I'm gonna guess just switching from round-robin to leastconn (most balancers offer that option) would solve that just fine. You can then go to dynamically tune server weights if you have servers of unequal size or some other issues.
Yeah I really don't understand why they went this direction as it builds considerable additional complexity directly into the application to solve a problem with an external component
I would have probably approached this by implementing a fix for the misbehaving part of k8s, though since there isnt a default LoadBalancer in k8s, I can't really can't speculate further as to the root cause of the initial problem. But most CNI or cloud providers that implement LB do have a way to take feedback from an external metric. I'd be curious why doing it this way wasn't considered, at least.
Yeah, that can work. Just yesterday I benchmarked load balancing of LLM workloads across 2 GPUs using a simple least_conn from nginx. The total token/sec scaled as expected (2 GPUs => 2x token/sec), and GPU utilization reached 100% on both, as I increased concurrency from 1 to 128 simultaneous generations.