Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.
In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.
That's not quite the same thing I think, the risk here is that the sources of training information vanishes as well, not necessarily the feedback loop aspect.
For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.
If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.
That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.
Sorry I guess it's not very clear from my post, the data points aren't what's missing it is the insights. An insight comes from the melding of a life of experiences, you can't just stick a bunch of sensors on humans and reach the same insights. The expression of latent space in our own brains you could think of it as. We're also not little one way input boxes, we run world sims in our brains all day long.
If you were to frame human brains as their own world models, Stack Overflow was a very lossy distillation from the brains of those insights.
I don't think you reach those insights by simply piping in data from the world (also that sounds expensive to do at a worthwhile scale)
Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.
In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.