Interesting! Fingers crossed someone who's looking for your skillset finds your post.
What is your process to turn Mac minis into a cluster? Is there any special hardware involved? And you can get 100x tok/s vs others on comparable hardware, what do you do differently - hardware, software, something else?
>What is your process to turn Mac minis into a cluster
1) Apply science. Benchmark everyting until you understand if its memory bound, i/o bound or compute bound [1].
2) Rewrite software from scratch in a parallel form with message passing.
3) Reverse engineer native instruction sets of CPU, GPU and ANE or TPU. Same for NVIDIA (don't use CUDA).
No special hardware needed but adding FPGA's for optimizing the network between machines might help.
So you analyse the software and hardware, then restructure it by reprogramming and rewireing and adaptive compilers.
Then you benchmark again and you find what hardware runs the algorithm fastest for less $ using less energy and weigh that against the extra cost for reprogramming.
I discussed all the points you ask about in my HN postings last month, but never in enough detail so you must ask me to specify and that's when people hire me.
As you can see from this comments thread, most people, especially programmers, lack the knowledge we computer scientist, parallel programmers and chip or hardware designers have.
>What is your process
Science. To measure is to know, my prof always said.
To answer your questions in detail, email me.
You first need to be specific. The problem is not how to turn Mac minis into a cluster, with or without custom hardware ( I do both) on code X or Y. Or how to optimize software or rewrite it from scratch (which its often cheaper).
First find the problem. In this case the problem is find the lowest OPEX and Capex to do the stated compute load versus changing the compute load. Turns out in a simulation or a cruder spreadsheet calculation it becomes clear that the energy cost dominates of hardware choice, it trumps the cost of programming, the cost of off the shelf hardware and the difference if you add custom hardware. M4's are lower power, lower OPEX and lower CAPEX especially if you rewrite your (Nvidia GPU) software. The problem is the ignorance of the managers and their employee programmers.
You can repurpose the 2 x 10 Gbps USB-C, the 10 Gbps Ethernet and the three 32 Gbps PCIe ports or Thunderbolts but you have to use better drivers. You need to weigh if double the 960 Gbps 16 GB unified memory for 2 x $400 is faster than 2 Tbps memory at 1.23 times the cost versus 3 x 4 x 32 Gbps PCIe 4.0 versus 3 x 120 Gbps unidirectionally is better for this particular algorithm and wheat changes if you uses both the 10 CPU cores, 10 x 400 GPU corses and 16 Neural Engine cores (at 38 trillion 16 bit OPS) will work batter than just the CUDA cores. Ususally the answers is: rewrite the alogoritm and use an adaptive compiler and then a cluster of smaller 'sweet spot' off the shelf hardware will outperform the most fancy high end hardware if the network is balanced. This varies at runtime so you'll only know if you now how to code. As Akan Kay said and Steve Jobs quoted: if your serious about software you should do your own hardware. If you can't, then you can approach the hardware with commodity components if that turns out to be cheaper. I estimate for $42K labour I can save you a few hundred $k.
What is your process to turn Mac minis into a cluster? Is there any special hardware involved? And you can get 100x tok/s vs others on comparable hardware, what do you do differently - hardware, software, something else?