What kind of hardware does HN recommend or like to run these models?

xienze · 2026-02-28T21:55:30 1772315730

It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.

rahimnathwani · 2026-02-28T22:06:21 1772316381

There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom

I'm curious which one you're using.

suprjami · 2026-02-28T22:11:35 1772316695

Unsloth Dynamic. Don't bother with anything else.

rahimnathwani · 2026-03-01T15:34:36 1772379276

For anyone else trying to run this on a Mac with 32GB unified RAM, this is what worked for me:

First, make sure enough memory is allocated to the gpu:

  sudo sysctl -w iogpu.wired_limit_mb=24000

Then run llama.cpp but reduce RAM needs by limiting the context window and turning off vision support. (And turn off reasoning for now as it's not needed for simple queries.)

  llama-server \
    -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    --jinja \
    --no-mmproj \
    --no-warmup \
    -np 1 \
    -c 8192 \
    -b 512 \
    --chat-template-kwargs '{"enable_thinking": false}'

You can also enable/disable thinking on a per-request basis:

  curl 'http://localhost:8080/v1/chat/completions' \
  --data-raw '{"messages":[{"role":"user","content":"hello"}],"stream":false,"return_progress":false,"reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"chat_template_kwargs": { "enable_thinking": true }}'|jq .

If anyone has any better suggestions, please comment :)

suprjami · 2026-03-02T04:59:03 1772427543

Shouldn't you be using MLX because it's optimised for Apple Silicon?

Many user benchmarks report up to 30% better memory usage and up to 50% higher token generation speed:

https://reddit.com/r/LocalLLaMA/comments/1fz6z79/lm_studio_s...

As the post says, LM Studio has an MLX backend which makes it easy to use.

If you still want to stick with llama-server and GGUF, look at llama-swap which allows you to run one frontend which provides a list of models and dynamically starts a llama-server process with the right model:

https://github.com/mostlygeek/llama-swap

(actually you could run any OpenAI-compatible server process with llama-swap)

rahimnathwani · 2026-03-02T05:04:44 1772427884

I didn't know about llama-swap until yesterday. Apparently you can set it up such that it gives different 'model' choices which are the same model with different parameters. So, e.g. you can have 'thinking high', 'thinking medium' and 'no reasoning' versions of the same model, but only one copy of the model weights would be loaded into llama server's RAM.

Regarding mlx, I haven't tried it with this model. Does it work with unsloth dynamic quantization? I looked at mlx-community and found this one, but I'm not sure how it was quantized. The weights are about the same size as unsloth's 4-bit XL model: https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-4bit/tr...

suprjami · 2026-03-02T08:40:51 1772440851

Yes that's right. The config is described by the developer here:

https://www.reddit.com/r/LocalLLaMA/comments/1rhohqk/comment...

And is in the sample config too:

https://github.com/mostlygeek/llama-swap/blob/main/config.ex...

iiuc MLX quants are not GGUFs for llama.cpp. They are a different file format which you use with the MLX inference server. LM Studio abstracts all that away so you can just pick an MLX quant and it does all the hard work for you. I don't have a Mac so I have not looked into this in detail.

BoredomIsFun · 2026-03-01T10:21:50 1772360510

FYI UD quants of 3.5-35BA3B are broken, use bartowski or AesSedai ones.

regularfry · 2026-03-01T14:15:42 1772374542

They've uploaded the fix. If those are still broken something bad has happened.

rahimnathwani · 2026-02-28T22:36:08 1772318168

UD-Q4_K_XL?

msuniverse2026 · 2026-02-28T22:13:41 1772316821

I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

pja · 2026-02-28T23:07:41 1772320061

> I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.

Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.

wirybeige · 2026-02-28T22:32:46 1772317966

The vulkan backend for llama.cpp isn't that far behind rocm for pp and tp speeds

mmis1000 · 2026-03-03T12:58:08 1772542688

I think AMD just add support of rocm to rdna2 recently? I can run torch and aisudio with it just fine.

They also finally fix all ai related stuff building on windows, so you are no longer limited to linux for these.

suprjami · 2026-02-28T22:01:50 1772316110

The cheapest option is two 3060 12G cards. You'll be able to fit the Q4 of the 27B or 35B with an okay context window.

If you want to spend twice as much for more speed, get a 3090/4090/5090.

If you want long context, get two of them.

If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.

barrkel · 2026-02-28T22:34:33 1772318073

Rtx 6000 pro Blackwell, not ada, for 96GB.

suprjami · 2026-03-01T13:17:22 1772371042

Ah thanks.

The names are so good and not repetitious.

No not the RTX 6000. No not the A6000...

chr15m · 2026-03-01T03:36:12 1772336172

Thanks this is a great summary of the tradeoffs!

dajonker · 2026-02-28T22:13:14 1772316794

Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.

cyberax · 2026-03-01T01:20:34 1772328034

I have a pair of Radeon AI PRO R9700 with 32Gb, and so far they have been a pleasure to use. Drivers work out-of-the-box, and they are completely quiet when unused. They are capped at 300W power, so even at 100% utilization they are not too loud.

I was thinking about adding after-market liquid cooling for them, but they're fine without it.

rubiquity · 2026-03-01T07:03:03 1772348583

This is great to hear! Out of curiosity, which brand did you go with? I tend to stick to Sapphire but the prices are within $200 of each other.

cyberax · 2026-03-02T02:13:11 1772417591

I got Sapphires because they were the ones available at the time of purchase :)

CamperBob2 · 2026-02-28T22:24:29 1772317469

I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.

I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.

Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.

MarsIronPI · 2026-02-28T22:34:07 1772318047

I've had good experience with GLM-4.7 and GLM-5.0. How would you compare them with Qwen 3.5? (If you have any experience with them.)

CamperBob2 · 2026-02-28T23:48:08 1772322488

No experience with 5 and not much with 4.7, but they both have quite a few advocates over on /r/localllama.

Unsloth's GLM-4.7-Flash-BF16.gguf is quite fast on the 6000, at around 100 t/s, but definitely not as smart as the Qwen 3.5 MoE or dense models of similar size. As far as I'm concerned Qwen 3.5 renders most other open models short of perhaps Kimi 2.5 obsolete for general queries, although other models are still said to be better for local agentic use. That, I haven't tried.

andsoitis · 2026-02-28T22:20:28 1772317228

For fast inference, you’d be hard pressed to beat an Nvidia RTX 5090 GPU.

Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...

laweijfmvo · 2026-02-28T22:37:10 1772318230

I never would have guessed that in 2026, data centers would be measured in Watts and desktop PCs measured in liters.

andsoitis · 2026-02-28T23:41:45 1772322105

The Omen was neigh.

zozbot234 · 2026-02-28T22:25:23 1772317523

It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?

throwdbaaway · 2026-03-01T03:47:23 1772336843

For 27B, just get a used 3090 and hop on to r/LocalLLaMA. You can run a 4bpw quant at full context with Q8 KV cache.

elorant · 2026-02-28T22:25:32 1772317532

Macs or a strix halo. Unless you want to go lower than 8-bit quantization where any GPU with 24GBs of VRAM would probably run it.