Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Ones that the frontier labs have spent a lot of AI-specialized data, compute, labor and hours of R&D work on.

Granted thats time and money but it's an absolute minuscule amount of human hours compared to the scraped data.

We know this for a fact because of parallelization, work of hundreds of millions vs the work of 20-100 even of OpenAIs team worked for the entire lifetimes of the current team and the lifetimes of the offspring of that team and the lifetimes of their offspring even with several lifetimes they still wouldnt have even made a dent in recreating that initial scraped training data.



This is like trying to apply "labor theory of value" to datasets. It doesn't work any better there than it does in economics in general.

It doesn't matter how many human hours went into making a Twitter shitpost. What matters is: how much value does it add to pre-training run, and how easy is it to substitute it for another data source.

"Cheap data" has low training value and is easy to replace. Twitter shitposts are worthless except in aggregate. "Expensive data" is what has high training value and is hard to replace. Things like SFT traces, domain expert RLHF guidance, RLVR bits - that's what the "moat" is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: