It is true that you can only squeeze 100% of the maximum possible useful compute out of a NUMA system with methods like the article author was suggesting. The less coordination there is between cores, the less cross-core or cross-socket communication is needed, all of which is overhead.
Caveat: If a bunch of independent processes are processing independent data, they'll increase cache thrashing at L2 and higher levels. Synchronised threads running the same code more-or-less in lockstep over the same areas of the data can benefit from sharing that independent processes can't. In some scenarios, this can be a huge speedup -- just ask a GPU programmer!
Where the process-per-core argument definitely stops being a good approach is when you start to consider latency.
Literally just this week, I need to help someone working on a Node.js app that needs to pre-cache a bunch of very expensive computations (map tiles over data changing on an interval).
Because this is CPU-heavy and Node.js is single-threaded, it kills the user experience while it is running. Interactive responses get interleaved with batch actions, and users complain.
This is not a problem with ASP.NET where this kind of work can simply run in a background thread and populate the cache without interfering with user queries!
Someone familiar with this please confirm: if you have a node.js app, which is true single-threaded, do people run multiple copies per physical CPU, or do they just max out one core and leave the rest idle? Or lease the cores separately from their hosting provider by running a bunch of one-core container instances or something?
Node supports 2 ways of getting an additional event loop threads (this being the “single-threaded” part that people talk about with Node often without understanding much about its internals, as the Node process itself spawns many threads in the background).
The first mode is child process: the main process forks an entirely separate instance of Node with its own event loop, which you communicate with over IPC or some network socket.
The second mode (introduced fairly recently) is the ability to spin off worker threads which have their own event loop but share the worker thread pool of the main process. I think there is a way to share memory between these threads via some special type of buffer, but I have never used them.
The first mode maps directly to the idea of micro-services, just running on the same machine. This is why it is not really used AFAIK in modern cloud based apps, with single core micro service instances used instead. This approach has a higher latency cost but allows cheaper instances and much simpler services - it very much depends on the use case to decide if that is correct choice or not.
I mentally translate “modern web development” to: proxies for the proxies and layers of load balancers upon load balancers to make the envoys work with the ingress, all through an API management layer for good measure… and then a CDN.
It’s a common architecture antipattern to try to run non-interactive tasks on the same resources as interactive ones.
Even if you avoid the initial problems by punching the CPU priority through the floor, someone will eventually introduce a bug/feature that increases IOPS or memory usage drastically, or finds some other way to accomplish priority inversion. You will think you can deploy this code any time you want and discover that you can’t.
And having survived the first and second crises, you will arrive at the final one: a low-priority process is not a no-priority process. People will expect it to be completed “on time”. And as the other processes saturate the available resources via either code bloat or right sizing, now your background task isn’t completing the way people have come to expect it to complete. It stops being able to be “free” by parasitism and has to have its own hardware.
It is true that you can only squeeze 100% of the maximum possible useful compute out of a NUMA system with methods like the article author was suggesting. The less coordination there is between cores, the less cross-core or cross-socket communication is needed, all of which is overhead.
Caveat: If a bunch of independent processes are processing independent data, they'll increase cache thrashing at L2 and higher levels. Synchronised threads running the same code more-or-less in lockstep over the same areas of the data can benefit from sharing that independent processes can't. In some scenarios, this can be a huge speedup -- just ask a GPU programmer!
Where the process-per-core argument definitely stops being a good approach is when you start to consider latency.
Literally just this week, I need to help someone working on a Node.js app that needs to pre-cache a bunch of very expensive computations (map tiles over data changing on an interval).
Because this is CPU-heavy and Node.js is single-threaded, it kills the user experience while it is running. Interactive responses get interleaved with batch actions, and users complain.
This is not a problem with ASP.NET where this kind of work can simply run in a background thread and populate the cache without interfering with user queries!
For similar reasons, Redis replacements that use multi-threading have far lower tail latencies: https://microsoft.github.io/garnet/