this is what i don't get. How can GPT-5 ace obscure AIME problems while simultaneously falling into the trap of the most common fallacy about airfoils (despite there being copious training data calling it out as a fallacy)? And I believe you that in some context it failed to understand this simple rearrangement of terms; there's sometimes basic stuff I ask it that it fails at too.
It still can't actually reason, LLMs are still fundamentally madlib generators that produce output that statistically looks like reasoning.
And if it is trained on both sides of the airfoil fallacy it doesn't "know" that it is a fallacy or not, it'll just regurgitate one or the other side of the argument based on if the output better fits your prompt in its training set.
I've benchmarked a lot of these newest AI models on private problems that require only insight, no clever techniques, since the first reasoning preview came out (o1?) a year ago.
The common theme I've seen is that AI will just throw "clever tricks" and then call it a day.
For example, a common game theory operation that involves xor is Nim. Give it a game theory problem that involves xor, but doesn't relate to Nim at all, and it will throw a bunch of "clever" Nim tricks at the problem that are "well known" to be clever in the literature, but don't actually remotely apply, and it will make up a headcanon about how it's correct.
It seems like AI has maybe the actual reasoning of a 5th grader, but the knowledge of a PhD student. A toddler with a large hammer.
Also, keep in mind that it's not stated if GPT-5 has access to python, google, etc. while doing these benchmarks, which certainly makes it easier. A lot of these problems are gated by the fact that you only have ~12 minutes to solve it, while AI can go through so many solutions at once.
No matter what benchmarks it passes, even the IMO (as someone who's been in the maths community for a long time), I will maintain the position that, none of your benchmarks matter to me until it can actually replace my workflow and creative insights. Trust with your own eyes and experiences, not whatever hype marketing there is.
Because reading the different ideas about airfoils and actually deciding which is the more accurate requires a level of reasoning about the situation that isn't really present at training or inference time. A raw LLM will tend to just go with the popular option, an RLHF one might be biased towards the more authoritative-sounding one. (I think a lot of people have a contrarian bias here: I frequently hear people reject an idea entirely because they've seen it be 'debunked', even if it's not actually as wrong as they assume)
Genuine question, are these companies just including those "obscure" problems in their training data, and overfitting to do well at answering them to pump up their benchmark scores?
o3-pro, gpt5-pro, gemini 2.5-pro, etc. still can't solve very basic first-principles math problems that just rely on raw thinking, no special tricks. I think personally because it's not in its training data - if I inspect their CoT/reasoning, it's clear to me at the very least that they're just running around in circles applying "well known" techniques and just hoping that it applies (without actually logically verifying that it does). Very inhuman reasoning style (that's ultimately incorrect). It's like somebody was taught a bunch of PhD level tricks but has the actual underlying reasoning of a toddler.
I wonder how well their GPT-5 IMO research model would do on some of my benchmark problems.
Context matters a lot here - it may fail on this problem within a particular context (what the original commenter was working on), but then be able to solve it when presented with the question in isolation. The way your phrase the question may hint the model towards the answer as well.
Yesterday, Claude Opus 4.1 failed in trying to figure out that `-(1-alpha)` or `-1+alpha` is the same as `alpha-1`.
We are still a little bit away from AGI.