Purely anecdotal, but GPT 5.4 has been better than Opus 4.6 this past week or so...

sigmoid10 · 2026-03-08T17:49:58 1772992198

Chatbot Arena is notoriously unreliable for several reasons. First it's (at least in theory) based on normal human feedback. Given by normal people's current voting trends, they clearly are not very good at identifying experts or at least remotely correct statements. Second, the leaderboards are gamed hard by the big companies. Even ARC AGI entered the actively gamed stage by now. Sure the current gen models are certainly better than the last and if two are vastly different in leaderboards there may be something fundamental to it, but there is hardly any reason to use these kinds of comparison tables for anything useful among the latest models.

EugeneOZ · 2026-03-08T17:56:48 1772992608

Not in my experience. Quoting my tweet:

Gave the same prompt to GPT 5.4 (high) and Opus 4.6 (high).

GPT 5.4 implemented the feature, refactored the code (was not asked to), removed comments that were not added in that session, made the code less readable, and introduced a bug. "Undo All".

Opus 4.6 correctly recognized that the feature is already implemented in the current code (yeah, lol) and proposed implementing tests and updating the docs.

Opus 4.6 is still the best coding agent.

So yeah, GPT 5.4 (high) didn't even check if the feature was already implemented.

Tried other tasks, tried "medium" reasoning - disappointment.

hirvi74 · 2026-03-08T18:36:00 1772994960

I make ChatGPT and Claude code review each other's outputs. ChatGPT thinks its solutions are better than what Claude produces. What was more surprising to me is that Claude, more often than not, prefers ChatGPT's responses too.

I am to sure one can really extrapolate much out of that, but I do find it interesting nonetheless.

I think language is also an important factor. I have a hard time deciding which of the two LLMs is worse at Swift, for example. They both seem equally great and awful in different ways.

stavros · 2026-03-08T23:58:59 1773014339

I do the same (I have both review a piece of code), and Codex tend to produce more nitpicky feedback. Opus usually agrees with it on around half the feedback, but says that the other half is too nitpicky to implement. I generally agree with Opus' assessment, and do agree that Codex nitpicks a lot.

I can't even use Codex for planning because it goes down deep design rabbit holes, whereas Opus is great at staying at the proper, high level.

frde · 2026-03-08T22:48:55 1773010135

Is this sample size of one task, or a consistent finding across many tasks?