I'm not generally inclined toward the "they are cheating cheaters" mindset, but ...

I'm not generally inclined toward the "they are cheating cheaters" mindset, but I'll point out that fine tuning is not the same as retraining. It can be done cheaply and quickly.

Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.