Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because loss != quality. This was one of the most counterintuitive discoveries in ML for me. People treat the two as interchangeable, and to a certain extent — a controlled extent — they are.

But if your dataset doesn’t include a word about Captain Picard, no amount of training will get it to know about the USS enterprise. Yet your loss metrics will still reach that magical 2.1 value with time. (2.1 is pretty much “excellent” quality; below that means you’re probably overfitting and need a bigger dataset.)

Thanks for the comment friendo. I wasn’t sure if this would get any attention at all, but that made it worth it. Be sure to DM me on Twitter if you’d like to chat about anything ML related: basic questions are one of my favorite things to assist with too, so feel free.



This isn't really correct.

Loss is a training-time measurement based on performance on the training objective.

The training objective is rarely the same as an end user task that is being benchmark.

For example, classically language models are training on next token prediction. The closest benchmark for that is perplexity[1], often reported on the WikiText-103 dataset.

Until around 2019 this was often reported, but since then most large language model papers have moved onto reporting more useful benchmarks. Some examples of this are question answering performance or maybe embedding performance.

Unfortunately there aren't great benchmarks (yet?) for generative tasks. Quality is quite hard to measure here in a systematic way (see, eg the issues with BLEU benchmarks in summarization benchmarks).

[1] https://en.wikipedia.org/wiki/Perplexity




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: