> Diffusion models work differently. Instead of predicting text directly, they l...

smallerize · 2025-05-20T18:04:40 1747764280

Mercury Coder https://news.ycombinator.com/item?id=43187518 has a paper and published code. https://huggingface.co/papers/2503.07197

heliophobicdude · 2025-05-20T19:26:24 1747769184

Thank you for sharing this. I'm amazed! Are there any known emergent abilities of it? I ran my evals and seems to struggle in very similar ways to smaller transformer based LLMs

erinaceousjones · 2025-05-21T06:51:29 1747810289

I heard it's been a difficult project to justify spending the research/computer time on at scale, because the models use an equivalent amount of compute for training and inference, but more parallelizable. So 5 times more compute units can be required and they get the work done 5 times faster. On a Google scale, that meant the hard internal sell of justifying burning through $25 million worth of compute units over 1 day instead of $5 million each day for 5 days. Something like that.