Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Couldn't you use the same argument to reach the absurd conclusion that the 7zip source code contains the vast majority of Harry Potter?

A decent control would be to compare it to similar prose that you know for a fact is not in the training data (e.g. because it was written afterwards).



I think the same argument would have to compare 7zip's compression to some other compression algorithm. Then we can say things like "7zip is a better/worse model of human writing". And that's probably a better way to talk about this as well.

You're right that a better baseline could be made using books not in the training set, to understand how much is the model learning prose and how much is learning a specific book.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: