On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s w...

imiric · 2025-09-23T11:19:47 1758626387

Thanks, but I find it hard to believe that a Q1 model would produce decent results.

I see that the Q2 version is around 42GB, which might be doable on 2x 3090s, even if some of it spills over to CPU/RAM. Have you tried Q2?

elsombrero · 2025-09-23T12:33:37 1758630817

well, I tried it and it works for me. llm output is hard to properly evaluate without actually using it.

I read a lot of good comments on r/localllama, with most people suggesting qwen3 coder 30ba3b, but I never got it to work as well as GLM 4.5 air Q1.

As for using Q2, it will fit in vram, but with very small context or spill over to RAM, but with quite an impact on speed depending on your setup. I have slow ddr4 ram and going for Q1 has been a good compromise for me, but YMMV.

ericdotlee · 2025-09-23T13:37:07 1758634627

What is llama-swap?

Been looking for more details about software configs on https://llamabuilds.ai

elsombrero · 2025-09-23T15:33:48 1758641628

https://github.com/mostlygeek/llama-swap

it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model

so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"