When people claim that there is such a thing as "X% accuracy in reasoning", it's really hard to take anything else seriously, no matter how impressive.
AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.
But this is going already too deep IMO.
When people start talking about percentages or benchmark scores, there has to be some denominator.
And there can be no bias-free such denominator for
- trivia questions
- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)
- historical or policital questions
I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.
Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).
Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...
There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.
Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit.
Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?
I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.
I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".
I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.
Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.
Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.
I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].
AI (and humans!) aside, claiming that there was an oracle that could "answer all questions" is a solved problem. Such a thing cannot exist.
But this is going already too deep IMO.
When people start talking about percentages or benchmark scores, there has to be some denominator.
And there can be no bias-free such denominator for
- trivia questions
- mathematical questions (oh, maybe I'm wrong here, intuitively I'd say it's impossible for various reasons: varying "hardness", undecidable problems etc)
- historical or policital questions
I wanted to include "software development tasks", but it would be a distraction. Maybe there will be a good benchmark for this, I'm aware there are plenty already. Maybe AI will be capable to be a better software developer than me in some capacity, so I don't want to include this part here. That also maps pretty well to "the better the problem description, the better the output", which doesn't seem to work so neatly with the other categories of tasks and questions.
Even if the whole body of questions/tasks/prompts would be very constrained and cover only a single domain, it seems impossible to guarantee that such benchmark is "bias-free" (I know AGI folks love this word).
Maybe in some interesting special cases? For example, very constrained and clearly defined classes of questions, at which point, the "language" part of LLMs seems to become less important and more of a distraction. Sure, AI is not just LLMs, and LLMs are not just assistants, and Neural Networks are not just LLMs...
There the problem begins to be honest: I don't even know how to align the "benchmark" claims with the kind of AI they are examinin and the ones I know exist.
Sure it's possible to benchmark how well an AI decides whether, for example, a picture shows a rabbit. Even then: for some pictures, it's gotta be undecidable, no matter how good the training data is?
I'm just a complete layman and commenting about this; I'm not even fluent in the absolute basics of artificial neural networks like perceptrons, gradient descent, backpropagation and typical non-LLM CNNs that are used today, GANs etc.
I am and was impressed by AI and deep learning, but to this day I am thorougly disappointed by the hubris of snakeoil salespeople who think it's valuable and meaningful to "benchmark" machines on "general reasoning".
I mean, it's already a thing in humans. There are IQ tests for the non-trivia parts. And even these have plenty of discussion revolving around them, for good reason.
Is there some "AI benchmark" that exclusively focuses on doing recent IQ tests on models, preferably editions that were published after the particular knowledge cutoff of the respective models? I found (for example) this study [1], but to be honest, I'm not the kind of person who is able to get the core insights presented in such a paper by skimming through it.
Because I think there are impressive results, it's just becomimg very hard to see through the bullshit at as an average person.
I would also love to understand mroe about the current state of the research on the "LLMs as compression" topic [2][3].
[1] https://arxiv.org/pdf/2507.20208
[2] https://www.mattmahoney.net/dc/text.html
[3] https://arxiv.org/abs/2410.21352