Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Do you think ML model weights are copyrightable?
22 points by ronsor on March 12, 2023 | hide | past | favorite | 59 comments
With the leaking of Facebook's LLaMA model and the seemingly endless copyright drama surrounding Stable Diffusion (but oddly not DALL-E), I pose two questions:

1. Do you think model weights can actually be copyrighted?

2. Should they be?



1. No. The weights are not a work of authorship. Nobody takes any intentional action to make the weights be what they are. Only works of authorship are copyrightable. And a copyright comes into being as the property of an author... which, as I said, models don't have.

The closest thing to authorship that happens is curation of the training data, but that's too tenuous to support a copyright. For one thing, the curation choices aren't directly targeted to shape the weights themselves, only the output when a separate program runs inference on those weights. And even for the output, there's probably not enough direct control there to support much of a claim of authorship.

And the people building the models had better hope that's true, because if that curation is strong enough to support a copyright, then it's really hard to claim that the contributions from the training data aren't strong enough to make the model a derivative work of every single training item.

2. No. The last thing we need is more ways for people with a bit of capital to get vast sweeping IPR claims.

Also, isn't about time HN got a real Markdown engine?


The idea copyright is to protect the author’s specific personality in the expression of an idea while, at the same time, allowing for the free circulation of the idea itself.

An ML model has no personality. It is not a legal entity. Also, its outputs are in no way the direct expression of the personality of those who trained it.

It is precisely this what differentiates ML from regular code: With regular code, it’s the programmer who solves the problem in a way that is similarly personal as someone writing a novel or an easy. But with ML, the specific implementation is found by the software itself.

You can’t have it both ways. Both not having to write the solution itself AND claim personal ownership of the results.

(But of course, we’ll see what the courts actually say.)


A model weight is the result of a human creative process, and in no way less an expression of an idea than a coded program. Creative choices are made in regards to training data and in writing and running the implementation.

Is the output of an ML model a creative work? I would argue that it also is. The real question, however, is whose work. Or rather, what proportion of the work was created by the original artists, and what proportion was created by the ML developers.

The questions at the heart of this is similar to the one society had about samples in music. I don't think anyone today would deny that sampling can be a legitimate creative process.


Stable Diffusion is trained on 2.3 billion images. It would take a human ~220 years to decide if an image should be included in the training set under the condition that a decision takes 1 second and the human works for 8 hours straight year-round.

GPT is even worse in that regard, it was literally trained on the whole web plus other sources. I'd be surprised if a judge would follow the argument that the training data was make by creative choice instead of raw crawling power. I'd argue the size of training data makes it impossible to call for a creative selection process for the data.


That’s precisely the funny thing with copyright:

If you create a work where you can clearly tell what the source was for your inspiration because you stole from another source, it’s a violation of copyright. But if you create a work and you can not tell what the source is of your inspiration, because you stole from so many different sources, not only is not a violation of copyright, but it’s actually the creation of a new copyrighted work in its own right.

ML is short-circuiting this legal framework. Because now stealing from thousands of different authors, in a way that it’s no longer possible to tell the sources can now be done with the press of a button.


It's a thing. Someone wrote it. Raw crawling power was created. Someone made the choice to use raw crawling power. This is all a creative act.

Different models do different things. If there was no creation involved, they would not.


> The idea copyright is to protect the author’s specific personality in the expression of an idea while, at the same time, allowing for the free circulation of the idea itself.

Is that really the idea of copyright? I’m not sure it is.

The purpose of copyright is to allow Creators to create works and then make a living from them without others making copies and reselling them for their own profit.

For physical goods you don’t need anything special to ensure widgets aren’t resold infinitely, because resale of one leaves the seller without the item, and making new physical copies requires a lot of effort in materials, time, building factories, etc. so there’s a natural pressure/market forces to keep that kind of activity in check (and indeed we also have patent law to place other restrictions on this as well).

But with non-physical works that can be easily duplicated, the reseller can keep the original and sell effectively infinite copies.

With non-physical goods (i.e. ideas), you can’t remove them from someone’s mind after they’ve consumed them, so that’s where fair use and other balancing principles come in. Allowances are made that as long as a new work made by another Creator isn’t significantly close to the original, then copyright allows it, even if there’s room to argue that one is based off the other, because if you didn’t allow it then every part of human knowledge wouldn’t be allowed because someone else thought of it first.

So what’s the difference with MLs vs. a human brain? The dataset used to train the ML can be curated by a human to ensure restricted data doesn’t get in, and the data can be removed from the ML after training (such as by reverting to a previous version of bad data was added).

It then becomes a question of where did the original (training) data come from? If the human made the choice to wholesale scrape the Internet, then there’s clearly copyright violations since copyright (in the US) automatically grants Creators “all rights” unless otherwise waived (such as by adding a CC or Open Source license), and the human made the choice to ignore that distinction. The fact that most tech companies treat everything posted on the Internet as free use is going to bite them here. Only companies (like FB) that have full control of the data and user agreements where they clearly state this would be in the clear.

Ultimately this is going to be a collision of current practices where everyone treats online data like the Wild West, and is only getting away with the current state of things because bad behavior is so widespread that enforcement simply cannot keep up with the enormous scale of it.

The only code generating MLs that are safe to use are the ones trained using nothing but Open Source code, and even then the nuances of each type of OSS license need to be adhered to.


> The purpose of copyright is to allow Creators to create works and then make a living from them without others making copies and reselling them for their own profit.

Well, actually the stated purpose, in the US and many other places, is to do that as an instrumental subgoal. The underlying main goal is to encourage people to create and publish works, because those works existing is seen as good for the public.

If you say copyright should be extended to ML models for that reason (because it definitely doesn't cover them now), then that amounts to saying that public policy should be to encourage there being more of the models.

From what we've seen of how they're used and can be used, I'd say that it's not exactly obvious that creating more is a good policy to pursue.

> For physical goods you don’t need anything special to ensure widgets aren’t resold infinitely, because resale of one leaves the seller without the item,

Traditionally an argument against copyright...

> With non-physical goods (i.e. ideas), you can’t remove them from someone’s mind after they’ve consumed them, so that’s where fair use and other balancing principles come in.

Um, no. That sort of consideration might affect the definition people use for derivative works, but it's not the point behind things like fair use, and it had basically no role in the development of the concept. Fair use, like most limitations on copyright, is about the boundaries of what it's appropriate to let somebody control, and how tightly they should be allowed to control it, for whatever random reason. In the US, part of fair use is about trying to keep copyright law from stomping on freedom of speech. That's where all the stuff about transformative use and parody come from.

A lot of the rest of fair use is about what rights you as a copyright holder should yield when you sell somebody a copy of a work. There've been plenty of fair use controversies about directly photocopying pages directly out of books. And the result was that such direct copying often can be fair use.

But nobody, ever, has brought up the question of having to forget what you know in relation to fair use. That's more like the question of what constitutes a derivative work, but even there I don't think it's ever been a major argument.

Also, please stop Capitalizing Random Words like "Creator". We're not talking about gods here. We do not capitalize common nouns in English, and the ONLY place you're breaking that rule is for that specific word.


In 2016, we asked an actual IP lawyer the other question that has been fervently debated of late, “does the IP status of the training data taint the weights?”

He said “we have no idea, and we won’t until someone with a boatload of cash burns it in a lawsuit.”

So we might find out soon!


Did he give any sense of the best arguments that might be deployed on each side?

The non-copyright people will point to recipes. The weights are ideas, not expressions. A “simple set of directions” is uncopyrightable.

The copyright people will point to the ability to extract significant chunks of copyrighted material from the weights.

I find the non-copyright argument stronger.


All the model's weights are just a compressed representation of data it was trained on. Keep in mind it's open data, created by millions of people for free and placed either on Stackexchange or Reddit or Facebook or on Wikipedia. Thus, it is a nonsense to speak about any kind of copyright wrt neural networks.


Weights aren't a compressed representation of the data it was trained on because 99.99% of it can't be recovered verbatim. Also the fact that a piece of data is delivered to you free of cost doesn't mean they surrendered their copyright to you. This basically settles nothing.


> Weights aren't a compressed representation of the data it was trained on because 99.99% of it can't be recovered verbatim.

Cool. So a low quality JPEG isn't a compressed representation?


The law isn't settled on this but whatever the final answer is it will clearly not be resolved by relying on poor analogy. A model is NOTHING like a jpeg. It's also little like a human being learning. We are going to need to be adults and pass actual legislation speaking to this.


You completely missed the point of my comment.

The commenter was claiming that unless you can recover data "verbatim" an encoding cannot be considered compression.

My example was to illustrate the silliness of that point, not to draw a straight line comparison between model weights and a lossy image compression format.

As for legislation, I completely agree with you. There's real danger that, absent that, you'll just end up with legislation from the bench where judges attempt to shoehorn these issues into existing legal frameworks, and I suspect no one will be satisfied with the outcome.


Good luck suing me if I resized your photo to 1x1 resolution and posted it on my website. That's how little info is extracted from each image on average.


Cool, can I also make up statistics?

How about we use actual examples:

Getty is suing Stability because the model has captured so much information about its training set that it reproduces Getty watermarks in its results:

https://arstechnica.com/tech-policy/2023/02/getty-sues-stabi...

Github's Co-pilot can be coaxed into reproducing entire chunks of copyrighted code:

https://twitter.com/docsparse/status/1581461734665367554

But sure, go ahead, keep trying to convince me these models (particularly once they get large enough) aren't compressed representations of information represented in the underlying training data.


  All the model's weights are just a compressed representation of data it was trained on
That's not remotely accurate. I'd argue it's not even understood what the weights are or how they relate to the training data, and what is understood is not at all trivial.


That's... just not true.

We know these models capture aspects of the training dataset, otherwise training wouldn't be a thing. That's the entire point.

We also know these models create a very compact representation of the data they derive.

So it is perfectly reasonable to think of these things as a compressed representation. Is it lossy? Yes, of course. But so what?


A valid model is f(all inputs imaginable)=2.

No weights involved. You can apply as much training as you like, but its still just the value "2" as an output. There is no representation of the data.

ML models are a mix of two things:

1. Patterns defined by the individuals performing the model training. Commonly this is referred to as model architecture, but ultimately its the cobbling together of various items to create a way to get a signal from data in a certain way. (As a side note, thanks Statisticians and Computer Scientists for leading the way here!)

2. Model weights to minimize the uncertainty in those patterns. How that uncertainty is measured is part of (1).


>Is it lossy? Yes, of course. But so what?

"But so what" is missing the entire point of intelligence. You as an intelligent creature only work because of this lossiness. Reality presents far too much information to process and instead brains, especially in higher lifeforms have evolved to filter out as much useless information as possible in the most energy efficient way, while not losing important context.


Right there's over/under fitting. Train SD on the same data too many times it over produces it in results. Just because they tailor it to get more nebulous results doesn't mean the data doesn't exist in some form.


Looking for "meaning" in mathematical representations is often not going to be productive. Even for really simple things this is true. Imagine you give me a series of points, I do a quadratic regression, and give you back f(x) = 3.74x^2 - 8.31x + 13.33 as a means of approximating your points and even predicting future points!

What does the 3.74 "mean"? For something this simple you can still create some smart sounding explanations related to the curve, but even that quickly becomes impossible as you bump up the complexity a bit. The reality is that it doesn't really have any "meaning" - certainly not in terms of the input data. It's simply the weighting on the exponential that gives an approximation of your input set. That's all. And it's the exact same with the model weights.


If there are fewer bits of weights than original data, and the weights + an algorithm can be used to reproduce the original data (even lossily), then it's absolutely a compression in the information theory sense. The degree of lossyness needed before it's not considered a copyright violation is what humans determine case-by-case in court, but that doesn't make it any less of a "compression".

Just like your neurons "compressing" the gazillions of atoms in the world you experience down to smaller representative models in your head.


Can’t wait until you learn that you can copyright a database. As in a collection data and it’s relation fixed in a tangible medium.

So yes, by current definition model weights are absolutely eligible for copyright.


The key is that it's a fixed representation. But if I pull your weights and create my own representation, it wouldn't surprise me if that was not considered a derivative work, in the same way I can publish someone else's recipe so long as I rewrite it.

In the end we'll see, but I wouldn't be at all surprised if courts ruled that the data itself was not protected.


Recipes are not databases. You can repost a recipe straight up. The recipe itself is not copyrightable. The blog post/cookbook/etc which has a recipe embedded could be. (Copyrighted works can contain uncopyrightable elements)


That's not at all true.

If I take a picture of a recipe from a cookbook and post it, that is absolutely a copyright violation, as the specific presentation is a copyrighted work.

This is because you can't copyright facts or procedures, but you can copyright specific representations/presentations of those facts or procedures.

If I take that recipes, type it out, and post my own version? Yeah, that's totally fine.

But I can't verbatim copy yours.


> If I take that recipes, type it out, and post my own version?

You probably need to make sure that you change any elements that aren't either common standards of how recipes are written ("preheat oven to 350 degrees"), or absolutely dictated by the requirement to convey the procedure. Which for a recipe probably just amounts to a paraphrase at most. But I don't think you can get away with copying every recipe absolutely verbatim.

The real problem, of course, is that people with money, and adherents of certain ideologies presently popular among people with both money and power, are pressing for the widest possible definition of what's creative in every possible area, because they fundamentally think that the best policy is that absolutely everything should be ownable somehow.

I suspect that the actual practical result will be that if, as should happen and very well may happen in some countries, courts decide that copyright law doesn't cover ML models, legislators will just extend copyright law until it does.


You didn’t “copy the recipe”

You copied the fixed representation of that recipe and reproduced it. Yes, we agree. That is copyright infringement. You do not have that right.

Edit: I see my use of “repost” was the hinging confusion. Yes, that isn’t right. You can “reproduce the recipe”. When I said repost I meant copy and paste the text.


Be careful, even copying and pasting the text can get you in trouble depending on what's in that text (I suspect this is one of the reasons modern recipe websites include so much damn narrative crap, as it makes it easier to assert copyright ownership over the work), but I grant I'm kinda nitpicking at this point, I agree, I think we're on the same page. :)


I think whether you can "copyright a database" has a somewhat more complicated answer than you suggest, especially in the USA.

Was the "selection and arrangements" of the weights made by a human using at least some minimal level of creativity? I'm not really sure, it's an odd question. I'd lean toward no.

https://sco.library.emory.edu/research-data-management/publi...


The trouble with anyone trying to copyright their model weights is that it's likely to make legal arguments their model isn't a derivative work of copyrighted inputs harder to sustain. Successfully copyrighting model weights (like databases or maps or other non-creative representations of factual inputted data) would be a very favourable outcome for creators of NNs who exclusively use licensed or public domain content, but not so much for a lot of the existing state of the art ML models...


Sure. And there are companies now that are training models on licensed data and those models are def 100% copyrightable without question (it’s a database)


Not a lawyer. 1. EU yes / US not so easy. Under EU law, content of a database can be copyrighted. I argue the weights of ML model are like the content in a database. Under US law the arrangement of data needs to be creative / original. Creative / original assumes a human expressing himself. Maybe the design of the ML model, selection of training material and selection of hyperparameter justify a copyright claim on a database of model weights. 2. Yes as resources were used to create it.


I think this has interesting overlap with the concept of illegal numbers [1].

[1] https://en.wikipedia.org/wiki/Illegal_number


The troll in me wants to get a tattoo with one of these numbers on my forehead just to see what the consequences could feasibly be.


Difficulty finding a job, getting dates, etc, etc


I was thinking more about legal consequences. Could I realistically be forced to undergo skin grafts. Legally - at least in my country - medical procedures without consent require the patient to be a danger to themselves or others, or it has to be an emergency. And even if the court said yes, would any medical professionals agree to do it? I kinda doubt it.

Would be an interesting court case though.


They're really long numbers.


No problem, just use a higher base or some smaller expression which computes to it.


Curious why it would be copyright vs patent. Weights are in essence part of an invention - I think the specific numbers would be pretty weird to copyright, but patenting the system that includes the model architecture and training process would offer a much more robust protection that's more along the lines of past physical inventions.

I'd be interested to look and see if there are any old analog computer patents, and how they are worded? Like if the specify list of resistor values, or focus more on architecture and tuning. I think that's a possible precedent to look to.


How much wiggle room is there with weights? Tweak a single weight, is that copyright infringement or a new weight set?


For small changes, lots!


> Do you think model weights can actually be copyrighted?

Under US law, the answer is clearly "no." US requires there to be, at minimum, a "spark of creativity." (Also, as far as legal definition is concerned, machines cannot possess any creativity.) If you try to start doing some creative legal argument to work around that, you're going to run into the issue that the same lines of legal thinking will conclude that you are infringing all of your training set... and trying to argue that infringing more or less everything ever made is somehow fair use is a truly uphill task.

You might be able to make a case for it under jurisdictions that follow the sweat-of-the-brow doctrine, but I know too little about jurisprudence in such jurisdictions to make any confident predictions.


The answer to (1) is plainly yes. Copyright is a human process. Things you think shouldn’t be copyrighted get copyright and vice versa.

Lawyers will consider how experts make decisions that affects and shapes the (sometimes far-) downstream representation of a valuable thing: thus an mp3 recording, a compiled executable, and, yes, a neural network’s model weights are all going to be in scope of copyright discussions.

As for “should”…I don’t know. I feel like the model I train to synthesise my own voice should be copyrightable. Meanwhile I’m wary of titanic foundation models destroying entire markets in part because of a copyright moat.


An argument can be made that model weights are a "creative work", in the same way that a list of pixel intensities and RGB values is a creative work. Who knows whether the argument will hold up or not.

As with all IP, it's one thing to be granted intellectual property rights, and a completely different thing to actually go about enforcing those rights. Without access to infringer's source code, it's challenging to make the case.


A list of pixel intensities and RGB values are only a creative work because they are a picture, it's the picture that's a creative work.

Lists of numbers as such are not generally copryightable in the USA.

Note that "facts" aren't copyrightable in the US, regardless of how much effort or money was spent assembling them. Neither are algorithms or procedures as such.


I really love this question. I think most of use will answer seeing "through the glass darkly" as it were.

I can't answer (1) as my opinion is just noise until it works through the legal system. I suspect they will be.

For (2), I think they should. You can't copyright the algorithm (recipe) but you can copyright the output (expression) of the recipe. On that note, this allows for open sourcing of models as well.


My opinion is that they should not be if they were created using public data that is itself not owned by the creator/trainer. Otherwise we have a new model of for profit piracy where huge companies exploit their capital advantage to dragnet the collective work of humanity and then claim ownership of the result.

If a model is trained on data that one owns the copyright to, it should be copyrightable.


1. If synthetically crafted DNA can be patented, then why wouldn't weights be eligible for the same protection? Both are kind of similar in function, I think…

2. Definitely no. Copyright should only apply to things entirely crafted by hand IMO.


Any information can be represented as a single number (albeit possibly a very large number). You can turn a movie into a number. Model weights are just numbers too. So yes copyright laws should apply.


1. No

2. NO

Absolutely not. Software should not be copyrightable because math equations are not copyrightable. But besides that, ML Models end up as polynomials at the end of the day, and that literally cannot be copyrighted.


Any work of fiction is also just a number. See Godel's Incompleteness Theorem.


hprotagonist probably has the most correct answer, but IMO no to both. Mostly because I’m looking at it from the spirit of copyright law, not the rule. The spirit is to encourage the creation of art/books/etc by allowing a artist/writer to make a living from their craft.

Not to mechanically create an index of facts about other art/writing, no matter its usefulness.


The weights don't do much without the model right? So shouldn't weights + model be considered together?


The weights are the model.

The small bits of standardish code to utilize them are not that material


If we could encode every permutation of weights, then it's like the Library of Babel for ML?

I guess we cannot copyright (GPU, CPU, algorithmic) effort in that case.

https://libraryofbabel.info/


I'd say that we'd need some more law to be certain, either some precedent ruling or explicit new legislation, but as of now some of the relevant and irrelevant arguments would be:

1) Purely mechanistic transformations aren't copyrightable, so if a model is determined to be that (it may be debated if it is or isn't), then it has no independent protection from copyright law other than any potential restrictions imposed by fragments of other copyrighted works included in that particular model.

2) On the other hand, even the tiniest bit of creativity is sufficient to make something copyrightable. Some work on ML definitely has some creativity, however is the "creative part" fully within the code used for training and, or is part of that creativity embedded also in the weights output by that code?

3) Facts about a copyrighted work; e.g. the word frequency statistics of a copyrighted book do not violate the work's copyright and also are not copyrightable themselves, so at least some language models - the very simplest word n-gram models, which were used both not that long ago for statistical machine translation and also long, long ago in pre-computer dictionary work - aren't covered by copyright, just as any other collection of facts. Databases may have specific ('sui generis') extra protection granted by law, but then that's a separate debate than general copyright.

4) In general, there are different things which often are conflated but in this complex situation the difference may be meaningful and even critical - e.g. (a) being a derived work of the training data; (b) violating the (limited, explicitly enumerated) exclusive rights of the author of that training data; and (c) being copyrightable itself - those are three separate things that might have different answers for the same type of model.

5) "sweat of the brow" doctrine asserts that (at least in USA) it doesn't matter how much effort and money it took to make something, so the effort/expense of training a model is not relevant to it 'deserving' protection and being copyrightable.

6) Not all creative creation is protected by copyright - the law lists very many protected types, but the rest are not protected unless courts get convinced that they fall under the umbrella of something explicitly listed in copyright law; for example, copyright applies to software because (and only because) program code is considered to be a type of literary work, which is one of the things listed in https://www.copyright.gov/title17/92chap1.html#102 or equivalent legislation elsewhere. Are model weights a type of literary work? That's a tricky question, and my opinion on it doesn't really matter, we'd need to hear a precendent-setting judge assert if they are or not.

... and there's probably many more arguments that should be considered before even trying to answer the original questions. And I think that it's illustrative to see what all the megacorps with megapaid megalawyers are doing; it seems that they're writing all their legal documents from a perspective where models might not be copyrightable and trying to assert whatever conditions they want under some type of contract, since conditions properly agreed under contract law will apply no matter how the copyright law will be interpreted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: