In the performance tests they said they used "consensus among 64 samples" and "r... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		markonen on Sept 12, 2024 \| parent \| context \| favorite \| on: Learning to Reason with LLMs In the performance tests they said they used "consensus among 64 samples" and "re-ranking 1000 samples with a learned scoring function" for the best results. If they did something similar for these human evaluations, rather than just use the single sample, you could see how that would be horrible for personal writing.

janalsncm on Sept 12, 2024 [–]

I don’t understand how that is generalizable. I’m not going to be able to train a scoring function for any arbitrary task I need to do. In many cases the problem of ranking is at least as hard as generating a response in the first place.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact