Because loss != quality. This was one of the most counterintuitive discoveries i...

nl · on March 5, 2023

This isn't really correct.

Loss is a training-time measurement based on performance on the training objective.

The training objective is rarely the same as an end user task that is being benchmark.

For example, classically language models are training on next token prediction. The closest benchmark for that is perplexity[1], often reported on the WikiText-103 dataset.

Until around 2019 this was often reported, but since then most large language model papers have moved onto reporting more useful benchmarks. Some examples of this are question answering performance or maybe embedding performance.

Unfortunately there aren't great benchmarks (yet?) for generative tasks. Quality is quite hard to measure here in a systematic way (see, eg the issues with BLEU benchmarks in summarization benchmarks).

[1] https://en.wikipedia.org/wiki/Perplexity