> We will run out of additional material to train on This sounds a bit silly. Mo...

_0ffh · 2026-01-21T00:51:36 1768956696

You'd be surprised how quickly improvement of autoregressive language models levels off with epoch count (though, admittedly, one epoch is a LOT). Diffusion language models otoh indeed keep profiting for much longer, fwiw.

zozbot234 · 2026-01-21T09:33:54 1768988034

Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw.

_0ffh · 2026-01-21T12:48:40 1768999720

Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.

https://arxiv.org/abs/2507.15857

pvab3 · 2026-01-20T23:31:13 1768951873

I'm just talking about text generated by human beings. You can keep retraining with more parameters on the same corpus

https://proceedings.mlr.press/v235/villalobos24a.html

x-complexity · 2026-01-21T01:35:16 1768959316

> I'm just talking about text generated by human beings.

That in itself is a goalpost shift from

> > We will run out of additional material to train on

Where it is implied "additional material" === "all data, human + synthetic"

------

There's still some headroom left in the synthetic data playground, as cited in the paper linked:

https://proceedings.mlr.press/v235/villalobos24a.html ( https://openreview.net/pdf?id=ViZcgDQjyG )

"On the other hand, training on synthetic data has shown much promise in domains where model outputs are relatively easy to verify, such as mathematics, programming, and games (Yang et al., 2023; Liu et al., 2023; Haluptzok et al., 2023)."

With the caveat that translating this success outside of these domains is hit-or-miss:

"What is less clear is whether the usefulness of synthetic data will generalize to domains where output verification is more challenging, such as natural language."

The main bottleneck for this area of the woods will be (X := how many additional domains can be made easily verifiable). So long as (the rate of X) >> (training absorption rate), the road can be extended for a while longer.