I did an interesting thing and looked at how well the Llama2 models could compre...

proaralyst · on Dec 30, 2023

While I don't disagree that these models seem to contain the ability to recreate copyrighted text, I don't think your conclusion holds. How well does zstd compress Harry Potter with a dictionary based on English prose? I think you'll get some impressive ratios, and I also think there's nothing infringing in this case.

regularfry · on Dec 30, 2023

What it tells you is that 93% of the information is sufficiently shared with the rest of the English language such that it can be pulled out into a shared codebook. LZMA doesn't have a codebook, not really.

In other words it's not that llama2 contains 93% of Chapter 1, it's that only 7% of Chapter 1 is different enough to anything else to be worth encoding in its own right.

sebzim4500 · on Dec 30, 2023

Couldn't you use the same argument to reach the absurd conclusion that the 7zip source code contains the vast majority of Harry Potter?

A decent control would be to compare it to similar prose that you know for a fact is not in the training data (e.g. because it was written afterwards).

aimor · on Dec 30, 2023

I think the same argument would have to compare 7zip's compression to some other compression algorithm. Then we can say things like "7zip is a better/worse model of human writing". And that's probably a better way to talk about this as well.

You're right that a better baseline could be made using books not in the training set, to understand how much is the model learning prose and how much is learning a specific book.

tayo42 · on Dec 30, 2023

This is a little confusing. You turned the text into indices? So numbers? Then compressed that? Or the text as numbers without any extra compression is only 1kb?

The tokenizer the models use,(sentence piece) is more or less based on one way to do compression.(bpe). It's not really clear what your testing.

daemonologist · on Dec 30, 2023

My reading is that at each generation step they ordered all possible next words by the probability assigned to them by the model and recorded the index of the true next word (so if the model was very good at predicting Harry Potter their indices would mostly be 0, 0, 0, ...).

aimor · on Dec 30, 2023

This is correct

stubish · on Dec 31, 2023

I wonder what the loss would be for 'translated into Finnish'? Translations between just about any human languages will contain less than 100% of the original.