If it was just "noisy", you could compensate with scale. It's worse than that.
"Human preference" is incredibly fucking entangled, and we have no way to disentangle it and get rid of all the unwanted confounders. A lot of the recent "extreme LLM sycophancy" cases is downstream from that.
Ai is not a hype. We have started to actually do something with all the data and this process will not stop soon.
Aline the RL what is now happening through human feedback alone (thumbs up/down) is massive.