Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Are you talking about just preferences or A/B tests on like retention and engagement? The latter I think is pretty reliable and powerful though I have never personally done them. Preferences are just as big a mess: WHO the annotators are matters, and if you are using preferences as a proxy for like correctness, you’re not really measuring correctness you’re measuring e.g. persuasion. A lot of construct validity challenges (which themselves are hard to even measure in domain).


Yes. All of them are poisoned metrics, just in different ways.

GPT-4o's endless sycophancy was great for retention, GPT-5's style of ending every response in a question is great for engagement.

Are those desirable traits though? Doubt it. They look like simple tricks and reek of reward hacking - and A/B testing rewards them indeed. Direct optimization is even worse. Combining the two is ruinous.

Mind, I'm not saying that those metrics are useless. Radioactive materials aren't useless. You just got to keep their unpleasant properties in mind at all times - or suffer the consequences.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: