> At the moment it feels like most people "reviewing" models depends on their be...

> At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models

I think you’ll always have some disagreement generally in life, but especially for things like this. Code has a level of subjectivity. Good variable names, correct amount of abstraction, verbosity, over complexity, etc are at least partially opinions. That makes benchmarking something subjective tough. Furthermore, LLMs aren’t deterministic, and sometimes you just get a bad seed in the RNG.

Not only that, but the harness and prompt used to guide the model make a difference. Claude responds to the word “ultrathink”, but if GPT-5 uses “think harder”, then what should be in the prompt?

Anecdotally, I’ve had the best luck with agentic coding when using Claude Code with Sonnet. Better than Sonnet with other tools, and better than Claude Code with other models. But I mostly use Go and Dart and I aggressively manage the context. I’ve found GPTs can’t write zig at all, but Gemini can, but they can both write python excellently. All that said, if I didn’t like an answer, I’d prompt again, but liked the answer, never tried again with a different model to see if I’d like it even more. So it’s hard to know what could’ve been.

I’ve used a ton of models and harnesses. Cursor is good too, and I’ve been impressed with more models in cursor. I don’t get the hype of Qwen though because I’ve found it makes lots of small(er) changes in a loop, and that’s noisy and expensive. Gemini is also very smart but worse at following my instructions, but I never took the time to experiment with prompting.