I'm gonna guess just switching from round-robin to leastconn (most balancers off...

gorkish · 2025-12-08T21:01:17 1765227677

Yeah I really don't understand why they went this direction as it builds considerable additional complexity directly into the application to solve a problem with an external component

I would have probably approached this by implementing a fix for the misbehaving part of k8s, though since there isnt a default LoadBalancer in k8s, I can't really can't speculate further as to the root cause of the initial problem. But most CNI or cloud providers that implement LB do have a way to take feedback from an external metric. I'd be curious why doing it this way wasn't considered, at least.

kgeist · 2025-12-09T07:41:22 1765266082

Yeah, that can work. Just yesterday I benchmarked load balancing of LLM workloads across 2 GPUs using a simple least_conn from nginx. The total token/sec scaled as expected (2 GPUs => 2x token/sec), and GPU utilization reached 100% on both, as I increased concurrency from 1 to 128 simultaneous generations.