> Comparing models, or even different versions of the same model, is a pseudo-sc...

> Comparing models, or even different versions of the same model, is a pseudo-scientific mess.

Reminder that in most cases, it's impossible to know if there is cross-contamination from the test set of the public benchmarks, as most LLMs are not truely open-source. We can't replicate them. So arguably it's worse in some cases, pretty much fraud if you account for the VC money pouring in. This is even more evident in unknown models from lesser known institutes like from UAE.