They did compare it to other models: https://x.com/OpenAI/status/199918210436266...

enlyth · 2025-12-11T19:22:39 1765480959

This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

tobias2014 · 2025-12-12T01:41:25 1765503685

And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.

minadotcom · 2025-12-11T21:24:41 1765488281

agreed.

sergdigon · 2025-12-12T07:20:32 1765524032

The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)

whimsicalism · 2025-12-11T22:27:00 1765492020

uh oh, where did SWE bench go :D

whimsicalism · 2025-12-12T02:11:06 1765505466

maybe they will release with gpt-5.2-codex