Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem is the methodology they use to hold them out. For a truly independent validation set, they need to hold out the material before augmentation, not after. If you hold out after augmentation, then you leverage biases from the training regimen already and hence you artificially boost your model's performance. This is not sufficient to demonstrate your model is generalizing properly.

In analogy: instead of taking leaves off of different trees, they are taking leaves from different branches from the same tree.



That would definitely make the evaluation more robust. My fear is that with LLMs at hand people became allergic to preparing good human-labelled evaluation sets and would always to some degree use an LLM as a crutch.


I would agree with that




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: