For anyone who’s already running this locally: what’s the simplest setup right n...

johndough · 2026-01-19T18:18:25 1768846705

I've been running it with llama-server from llama.cpp (compiled for CUDA backend, but there are also prebuilt binaries and instructions for other backends in the README) using the Q4_K_M quant from ngxson on Lubuntu with an RTX 3090:

https://github.com/ggml-org/llama.cpp/releases

https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...

https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...

    llama-server -ngl 999 --ctx-size 32768 -m GLM-4.7-Flash-Q4_K_M.gguf

You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions

Seems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.

mistercheph · 2026-01-19T18:36:40 1768847800

I think the recently introduced -fit option which is on by default means it's no longer necesary to -ngl, can also probably drop -c which is "0" by default and reads metadata from the gguf to get the model's advertised context size

johndough · 2026-01-19T21:00:20 1768856420

I had already removed three parameters which were no longer needed, but I hadn't yet heard that the other two had also become superfluous. Thank you for the update! llama.cpp sure develops quickly.

ljouhet · 2026-01-19T18:59:31 1768849171

Something like

    ollama run hf.co/ngxson/GLM-4.7-Flash-GGUF:Q4_K_M

It's really fast! But, for now it outputs garbage because there is no (good) template. So I'll wait for a model/template on ollama.com

jmorgan · 2026-01-19T22:28:31 1768861711

It's available (with tool parsing, etc.): https://ollama.com/library/glm-4.7-flash but requires 0.14.3 which is in pre-release (and available on Ollama's GitHub repo)

zackify · 2026-01-19T19:25:59 1768850759

LM Studio Search for 4.7-flash and install from mlx community

pixelmelt · 2026-01-19T17:25:43 1768843543

I would look into running a 4 bit quant using llama cpp (or any of its wrappers)