Apple said that Max's CPU is up to 2.5x faster and the Ultra's CPU is up to 3.8x faster than whatever intel CPU is in the iMac Pro, so you're getting about 52% more CPU performance with Ultra's doubling in CPU cores vs the Max, so definitely feeling some linear scaling limitations with the interconnect.
I don't work in the relevant space, but what makes coding for multi-cpu substantially harder than programming for multiple cores? Is it just having to manage separate memory for each CPU?
When you work with multiple cores they'll likely share the same L2 or at least L3 cache. Multiple CPUs often you pay the cost of copying that L3 cache over or need an L4 or worst case you go back to system RAM.
Each level that you go out further can drastically reduce performance, so you need to try and stay on nearby cores where possible.
Most devs don't account for that since dual CPU machines are in the vast minority
Yes, as all NUMA machines do, one CPU can access all memory, both local (to the CPU) and global (through the interconnect). The problem is that there is a significant latency cost when a CPU accesses non-local memory (limitations of the interconnect). So the HPC people writing their algorithms make sure that this happens at a minimal amount, by enforcing that the data each CPU is using is allocated locally as possible (ex. by using special affinity controls provided by libnuma)
I was just curious if these kinds of optimizations are possible in the M1 Ultra.
The way Apple presented it sounded more like the chips talked at a lower layer, much like if it was all built as one physical chip, than when you have two normal chips with an interconnect fabric.
Someone will figure it out with benchmarks or something.