I think the case for "axis must always go to 0" is overblown. Zero isn't always meaningful, for instance chance performance or performance of trivial algorithms is likely >0%. Sometimes if axis must go to zero you can't see small changes. For instance if you plot world population 2014-2024 on an axis going to zero, you won't be able to see if we are growing or shrinking.
Even starting at 30%, the MMLU graph is false. The four bars are wrong. Even their own 73,7% is not at the right height. The Mixtral 71.4% is below the 70% mark of the axis.
This is really the kind of marketing trick that makes me avoid a provider / publisher. I can't build trust this way.
I believe they are using the percentages as part of the height of the bar chart! I thought I'd seen every way someone could do dataviz wrong (particularly with a bar chart), but this one is new to me.
That's really strange and incredibly frustrating - but slightly less so if it's consistent with all of the bars (including their own).
I take issue with their choice of bar ordering - they placed the lowest-performing model directly next to theirs to make the gap as visible as possible, and shoved the second-best model (Grok-1) as far from theirs as possible. Seems intentional to me. The more marketing tricks you pile up in a dataviz, the less trust I place in your product for sure.
Interesting! It is probably one of the worst trick I have seen in a while for a bar graph. Never seen this one before. Trust vanishes instantly facing that kind of dataviz.
Wow, that is indeed a novel approach haha, took me a moment to even understand what you described since would never imagine someone plotting a bar chart like that.
MMLU is not a good benchmark and needs to stop being used.
I can't find the section, but at the end of one of https://www.youtube.com/@aiexplained-official/videos he runs down a deep dive of the questions and answers in MMLU, and there are so many typos, omissions, and errors in the questions and the answers that it should no longer be used.
It’s an honest mistake in scaling the bars. It’s getting fixed soon. The percentages are correct though. In the process of converting excel chart to pretty graphs for the blog, scale got messed up.
Certainly a bar chart might not be the best choice to convey the data you have. But if you choose to have a bar chart and have it not start at zero, what do the bars help you convey?
For world population you could see if it is increasing or decreasing, which is good but it would be hard to evaluate the rate the population is increasing.
I believe it's a reasonable range for the scores. If a model gets everything half wrong (worse than a coin flip), it's not a useful model at all. So every model below a certain threshold is trash, and no need to get granular about how trash it is.
An alternative visualization that could be less triggering to an "all y-axes must have zero" guy would be to plot the (1-value), that is, % degraded from perfect score. You could do this without truncating the axis and get the same level of differentiation between the bars
MMLU questions have four options, so two coin flips would have a 25% baseline. HumanEval evaluates code with a test, so a 100 byte program implemented with coin flips would have a O(2^-800) baseline (maybe not that bad since there are infinitely many programs that produce the same output). GSM-8K has numerical answers, so an average 3 digit answer implemented with coin flips would have a O(2^-9) chance of being correct randomly.
Moreover, using the same axis and scale across unrelated evals makes no sense. 0-100 is the only scale that's meaningful because 0 and 100 being the min/max is the only shared property across all evals. The reason for choosing 30 is that it's the minimum across all (model, eval) pairs, which is a completely arbitrary choice. A good rule of thumb to test this is to ask if the graph would still be relevant 5 years later.
> less triggering to an "all y-axes must have zero" guy
Ever read 'How to Lie with Statistics'? This is an example of exaggerating a smaller difference to make it look more significant. Dismissing it as just being 'triggered' is a bad idea.
In this case I would called it triggered (for lack of a better word), since, as I described earlier, a chart plotting "difference from 100%" would look exactly the same, and satisfy the zero-bound requirement, while not being any more or less dishonest
The point is less to use bad/wrong math; it's to present technically correct charts that nonetheless imply wrong conclusions. In this case, by chopping off the bottom of the chart, the visual impression of the ratio between the bars changes. That's the lie.
It does not feel obviously unreasonable/unfair/fake to place the select models in the margins for a relative comparison. In fact, this might be the most concise way to display what I would consider the most interesting information in this context.
Yeah, this is why I ask climate scientists to use a proper 0 K graph but they always zoom it in to exaggerate climate change. Display correctly with 0 included and you’ll see that climate change isn’t a big deal.
The scale should be chosen to allow the reader to correctly infer meaningful differences. If 1° is meaningful in terms of the standard error/ CI AND 1° unit has substantive consequences
, then that should be emphasized.
Manager: "looks ok, but can you make our numbers pop? just make the LLaMa bar smaller"