I did an interesting thing and looked at how well the Llama2 models could compress text. For example, I took the first chapter of the first Harry Potter book and recorded the index of the 'correct' predicted token. The original text, compressed with 7zip (LZMA?) to about 14kB. The Llama2 encoded indexes compressed to less than 1kB. Then, of course, I can send that 1kB file around and decode the original text. (Unless the model behaves differently on different hardware, which it probably does)
What I get from this is that Llama2 70B contains 93% of Harry Potter Chapter 1 within it. It's not 100% (which would mean no need to share the encoded indices) but it's still pretty significant. I want to repeat this with the entire text of some books, the example I picked isn't representative because the text is available online on the official website.
While I don't disagree that these models seem to contain the ability to recreate copyrighted text, I don't think your conclusion holds. How well does zstd compress Harry Potter with a dictionary based on English prose? I think you'll get some impressive ratios, and I also think there's nothing infringing in this case.
What it tells you is that 93% of the information is sufficiently shared with the rest of the English language such that it can be pulled out into a shared codebook. LZMA doesn't have a codebook, not really.
In other words it's not that llama2 contains 93% of Chapter 1, it's that only 7% of Chapter 1 is different enough to anything else to be worth encoding in its own right.
Couldn't you use the same argument to reach the absurd conclusion that the 7zip source code contains the vast majority of Harry Potter?
A decent control would be to compare it to similar prose that you know for a fact is not in the training data (e.g. because it was written afterwards).
I think the same argument would have to compare 7zip's compression to some other compression algorithm. Then we can say things like "7zip is a better/worse model of human writing". And that's probably a better way to talk about this as well.
You're right that a better baseline could be made using books not in the training set, to understand how much is the model learning prose and how much is learning a specific book.
This is a little confusing. You turned the text into indices? So numbers? Then compressed that? Or the text as numbers without any extra compression is only 1kb?
The tokenizer the models use,(sentence piece) is more or less based on one way to do compression.(bpe). It's not really clear what your testing.
My reading is that at each generation step they ordered all possible next words by the probability assigned to them by the model and recorded the index of the true next word (so if the model was very good at predicting Harry Potter their indices would mostly be 0, 0, 0, ...).
I wonder what the loss would be for 'translated into Finnish'? Translations between just about any human languages will contain less than 100% of the original.
What I get from this is that Llama2 70B contains 93% of Harry Potter Chapter 1 within it. It's not 100% (which would mean no need to share the encoded indices) but it's still pretty significant. I want to repeat this with the entire text of some books, the example I picked isn't representative because the text is available online on the official website.