Running a local model is not an apples comparison. Yes, if you run a small model...

ben_w · 2025-08-19T07:02:24 1755586944

When you combine that with serving millions of users, it also gets amortised over several million users.

> But most people want output now, not in 10 hours.

At 65t/s, that's 2.5 million tokens output.

mgfist · 2025-08-20T16:33:27 1755707607

Yes, but usage is not uniform even when you have millions of users. It smooths the usage lines, but the peaks and troughs become more extreme the more users you have. At 3am usage in the US goes down to effectively 0. Maybe you can use the compute for Asia customers, but then you compete with local compute that has far better latency.

Then you have seasonal peaks/troughs, such as the school year vs summer.

When you want 4 9s of uptime and good latency, you either have to overprovision hardware and eat idling costs, or rent compute and pay overhead. Both cost a lot.