> Wonder what am I doing wrong? You're comparing 100b parameters open models run...

delaminator · 2026-02-28T23:32:44 1772321564

Looks at the headline: Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

lm28469 · 2026-02-28T23:40:41 1772322041

Yes and Devstral 2 24b q4 is supposed to be 90% as good but it can't even reliably write to a file on my machine.

There are the benchmarks, the promises, and what everybody can try at home

8note · 2026-02-28T23:59:42 1772323182

maybe a harness problem?

SyneRyder · 2026-03-01T15:20:31 1772378431

Having tried the Mistral Vibe harness that was supposedly designed for Devstral, that thing is abysmal. I feel sorry for whatever they did to that model, it didn't deserve it.

The thing I most noticed was asking it for help with configuring local MCP servers in Mistral Vibe - something it supports, it literally shows how many MCP servers are connected on the startup screen - it then begins scanning my local machine for servers running "MineCraft Protocol".

I want Mistral to do well, and I use their Voxtral Transcribe 2, that one has been useful. I'd even like a well made Mistral Vibe (c'mon, "oui oui baguette" is a hilarious replacement for "thinking"). But Mistral are so far behind, and they don't seem to even know or accept that they are.

vlovich123 · 2026-02-28T23:26:43 1772321203

The hardware difference explains runtime performance differences, not task performance.

Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences

nl · 2026-03-01T03:01:32 1772334092

> Speculation is that the frontier models are all below 200B parameters

Some versions of some the models are around that size, which you might hit for example with the ChatGPT auto-router.

But the frontier models are all over 1T parameters. Source: watch interview with people who have left one of the big three labs and now work at the Chinese labs and are talking about how to train 1T+ models.

NamlchakKhandro · 2026-03-01T02:08:58 1772330938

> The hardware difference explains runtime performance differences, not task performance.

Yes it does.

MrDrMcCoy · 2026-03-01T05:11:29 1772341889

Care to elaborate?

BoredomIsFun · 2026-03-01T10:17:02 1772360222

Certainly not Opus. That beast feels very heavy - the coherence of longer form prose is usually a good marker, and it is able to spit 4000 words coherent short stories from a single shot.

827a · 2026-03-01T03:20:29 1772335229

He's running a 35B parameter model. Frontier models are well over a trillion parameters at this point. Parameters = smarts. There are 1T+ open source models (e.g. GLM5), and they're actually getting to the point of being comparable with the closed source models; but you cannot remotely run them on any hardware available to us.

Core speed/count and memory bandwidth determines your performance. Memory size determines your model size which determines your smarts. Broadly speaking.

regularfry · 2026-03-01T15:38:17 1772379497

The architecture is also important: there's a trade-off for MoE. There used to be a rough rule of thumb that a 35bxa3b model would be equivalent in smarts to an 11b dense model, give or take, but that's not been accurate for a while.

BoredomIsFun · 2026-03-01T10:18:16 1772360296

> There are 1T+ open source models (e.g. GLM5),

GLM-5 is ~750B model.

ses1984 · 2026-02-28T23:30:06 1772321406

Who would have thought ai labs with billions upon billions of r&d budget would have better models than a free alternative.

shlomo_z · 2026-03-01T06:22:08 1772346128

I'll add, AI Labs put a lot of resources into allowing the AI to search the web.. that makes a big difference

mstaoru · 2026-03-01T09:06:27 1772355987

I use search as well via openwebui + searxng.