> Diffusion models work differently. Instead of predicting text directly, they learn to generate outputs by refining noise, step-by-step. This means they can iterate on a solution very quickly and error correct during the generation process. This helps them excel at tasks like editing, including in the context of math and code.
This is deliberately unhelpful as it begs the question "why hasn't anyone else made a good text diffusion model in the years since the technology has been available?"
The answer to that question is that unlike latent diffusion for images which can be fuzzy and imprecise before generating the result image, text has discrete outputs and therefore must be more precise, so Google is utilizing some secret sauce to work around that limitation and is keeping it annoyingly close to the chest.
Thank you for sharing this. I'm amazed! Are there any known emergent abilities of it? I ran my evals and seems to struggle in very similar ways to smaller transformer based LLMs
I heard it's been a difficult project to justify spending the research/computer time on at scale, because the models use an equivalent amount of compute for training and inference, but more parallelizable. So 5 times more compute units can be required and they get the work done 5 times faster. On a Google scale, that meant the hard internal sell of justifying burning through $25 million worth of compute units over 1 day instead of $5 million each day for 5 days. Something like that.
This is deliberately unhelpful as it begs the question "why hasn't anyone else made a good text diffusion model in the years since the technology has been available?"
The answer to that question is that unlike latent diffusion for images which can be fuzzy and imprecise before generating the result image, text has discrete outputs and therefore must be more precise, so Google is utilizing some secret sauce to work around that limitation and is keeping it annoyingly close to the chest.