In the performance tests they said they used "consensus among 64 samples" and "re-ranking 1000 samples with a learned scoring function" for the best results.
If they did something similar for these human evaluations, rather than just use the single sample, you could see how that would be horrible for personal writing.
I don’t understand how that is generalizable. I’m not going to be able to train a scoring function for any arbitrary task I need to do. In many cases the problem of ranking is at least as hard as generating a response in the first place.
If they did something similar for these human evaluations, rather than just use the single sample, you could see how that would be horrible for personal writing.