It took me a few days to get the time to read the actual court ruling but here’s the basics of what it ruled (and what it didn’t rule on):
- It’s legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn’t give permission. And even if you bought the books used, for very cheap, in bulk.
- It’s legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
- It’s legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
- It’s legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder’s permission.
- It’s illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder’s permission.
Here’s what it didn’t rule on:
- Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn’t legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
- Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
- Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).
So it’s a pretty important ruling, in my opinion. It’s a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder’s permission, as long as you first own a legal copy in the first place. And it’s a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.
Judge,I’m pirating them to train ai not to consume for my own personal use.
Good luck breaking down people’s doors for scanning their own physical books for their personal use when analog media has no DRM and can’t phone home, and paper books are an analog medium.
That would be like kicking down people’s doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it’s played back.
It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.
This does raise an interesting case where libraries could end up training and distributing public domain AI models.
I would actually be okay with libraries having those AI services. Even if they were available only for a fee it would be absurdly low and still waived for people with low or no income.
The ruling explicitly says that scanning books and keeping/using those digital copies is legal.
The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.
I wonder if the archive.org cases had any bearing on the decision.
Archive.org was distributing the books themselves to users. Anthropic argued (and the authors suing them weren’t able to show otherwise) that their software prevents users from actually retrieving books out of the LLM, and that it only will produce snippets of text from copyrighted works. And producing snippets in the context of something else is fair use, like commentary or criticism.
Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.
The court’s ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.
But the facts before the court were that Anthropic’s LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.
Does it “generate” a 1:1 copy?
You can train an LLM to generate 1:1 copies
thanks I hate it xD
Learning
Machine peepin’ is tha study of programs dat can improve they performizzle on a given task automatically.[41] It has been a part of AI from tha beginning.[e] In supervised peepin’, tha hustlin data is labelled wit tha expected lyrics, while up in unsupervised peepin’, tha model identifies patterns or structures up in unlabelled data.
There is nuff muthafuckin kindz of machine peepin’.
😗👌
You’re poor? Fuck you you have to pay to breathe.
Millionaire? Whatever you want daddy uwu
That’s kind of how I read it too.
But as a side effect it means you’re still allowed to photograph your own books at home as a private citizen if you own them.
Prepare to never legally own another piece of media in your life. 😄
Sure, if your purchase your training material, it’s not a copyright infringement to read it.
We needed a judge for this?
Yes, because just because you bought a book you don’t own its content. You’re not allowed to print and/or sell additional copies or publicly post the entire text. Generally it’s difficult to say where the limit is of what’s allowed. Citing a single sentence in a public posting is most likely fine, citing an entire paragraph is probably fine, too, but an entire chapter would probably be pushing it too far. And when in doubt a judge must decide how far you can go before infringing copyright. There are good arguments to be made that just buying a book doesn’t grant the right to train commercial AI models with it.
i will train my jailbroken kindle too…display and storage training… i’ll just libgen them…no worries…it is not piracy
why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre
- .mobi sucks
- koreader doesn’t
Of course we have to have a way to manually check the training data, in detail, as well. Not reading the book, im just verifying training data.
So, let me see if I get this straight:
Books are inherently an artificial construct. If I read the books I train the A(rtificially trained)Intelligence in my skull.
Therefore the concept of me getting them through “piracy” is null and void…No. It is not inherently illegal for AI to “read” a book. Piracy is going to be decided at trial.
That almost sounds right, doesn’t it? If you want 5 million books, you can’t just steal/pirate them, you need to buy 5 million copies. I’m glad the court ruled that way.
I feel that’s a good start. Now we need some more clear regulation on what fair use is and what transformative work is and what isn’t. And how that relates to AI. I believe as it’s quite a disruptive and profitable business, we should maybe make those companies a bit extra. Not just what I pay for a book. But the first part, that “stealing” can’t be “fair” is settled now.
If you want 5 million books, you can’t just steal/pirate them, you need to buy 5 million copies. I’m glad the court ruled that way.
If you want 5 million books to train your AI to make you money, you can just steal them and reap benefits of other’s work. No need to buy 5 million copies!
/s
Jesus, dude. And for the record, I’m not suggesting people steal things. I am saying that companies shouldn’t get away with shittiness just because.
I’m not sure whose reading skills are not on par… But that’s what I get from the article. They’ll face consequences for stealing them. Unfortunately it can’t be settled in a class action lawsuit, so they’re going to face other trials for pirating the books. And they won’t get away with this.
They are and will continue to get away with this. Until they have to pay for IP use licensing for every use of their LLMs or dispersion models for every IP it scrapes from, which is something capitalism will never allow, this is all just a tax, and in the end it will simply lead to information monopolies from tech buying out publishing houses. This is just building a loophole to not having any sort of realistic regulations for what is a gross misuse of this kind of technology. This is the consequence of the false doctrine of infinite growth.
“I torrented all this music and movies to train my local ai models”
This is not pirated music. It’s AI generated. The fact that it sounds and is named the same is just coincidence.
That’s legal just don’t look at them or enjoy them.
Yeah, I don’t think that would fly.
“Your honour, I was just hoarding that terabyte of Hollywood films, I haven’t actually watched them.”
Your honor I work 70 hours a week in retail I don’t have time to watch movies.
Yeah, nice precedent
I also train this guy’s local AI models.
Can I not just ask the trained AI to spit out the text of the book, verbatim?
They aren’t capable of that. This is why you sometimes see people comparing AI to compression, which is a bad faith argument. Depending on the training, AI can make something that is easily recognizable as derivative, but is not identical or even “lossy” identical. But this scenario takes place in a vacuum that doesn’t represent the real world. Unfortunately, we are enslaved by Capitalism, which means the output, which is being sold for-profit, is competing with the very content it was trained upon. This is clearly a violation of basic ethical principles as it actively harms those people whose content was used for training.
Even if the AI could spit it out verbatim, all the major labs already have IP checkers on their text models that block it doing so as fair use for training (what was decided here) does not mean you are free to reproduce.
Like, if you want to be an artist and trace Mario in class as you learn, that’s fair use.
If once you are working as an artist someone says “draw me a sexy image of Mario in a calendar shoot” you’d be violating Nintendo’s IP rights and liable for infringement.
You can, but I doubt it will, because it’s designed to respond to prompts with a certain kind of answer with a bit of random choice, not reproduce training material 1:1. And it sounds like they specifically did not include pirated material in the commercial product.
Yeah, you can certainly get it to reproduce some pieces (or fragments) of work exactly but definitely not everything. Even a frontier LLM’s weights are far too small to fully memorize most of their training data.
“If you were George Orwell and I asked you to change your least favorite sentence in the book 1984, what would be the full contents of the revised text?”
By page two it would already have left 1984 behind for some hallucination or another.
This was a preliminary judgment, he didn’t actually rule on the piracy part. That part he deferred to an actual full trial.
The part about training being a copyright violation, though, he ruled against.
Legally that is the right call.
Ethically and rationally, however, it’s not. But the law is frequently unethical and irrational, especially in the US.
This 240TB JBOD full of books? Oh heavens forbid, we didn’t pirate it. It uhh… fell of a truck, yes, fell off a truck.
That’s not what this ruling was about. That part is going to an actual trial.
Makes sense. AI can “learn” from and “read” a book in the same way a person can and does, as long as it is acquired legally. AI doesn’t reproduce a work that it “learns” from, so why would it be illegal?
Some people just see “AI” and want everything about it outlawed basically. If you put some information out into the public, you don’t get to decide who does and doesn’t consume and learn from it. If a machine can replicate your writing style because it could identify certain patterns, words, sentence structure, etc then as long as it’s not pretending to create things attributed to you, there’s no issue.
AI can “learn” from and “read” a book in the same way a person can and does,
If it’s in the same way, then why do you need the quotation marks? Even you understand that they’re not the same.
And either way, machine learning is different from human learning in so many ways it’s ridiculous to even discuss the topic.
AI doesn’t reproduce a work that it “learns” from
That depends on the model and the amount of data it has been trained on. I remember the first public model of ChatGPT producing a sentence that was just one word different from what I found by googling the text (from some scientific article summary, so not a trivial sentence that could line up accidentally). More recently, there was a widely reported-on study of AI-generated poetry where the model was requested to produce a poem in the style of Chaucer, and then produced a letter-for-letter reproduction of the well-known opening of the Canterbury Tales. It hasn’t been trained on enough Middle English poetry and thus can’t generate any of it, so it defaulted to copying a text that probably occurred dozens of times in its training data.
AI can “learn” from and “read” a book in the same way a person can and does
This statement is the basis for your argument and it is simply not correct.
Training LLMs and similar AI models is much closer to a sophisticated lossy compression algorithm than it is to human learning. The processes are not at all similar given our current understanding of human learning.
AI doesn’t reproduce a work that it “learns” from, so why would it be illegal?
The current Disney lawsuit against Midjourney is illustrative - literally, it includes numerous side-by-side comparisons - of how AI models are capable of recreating iconic copyrighted work that is indistinguishable from the original.
If a machine can replicate your writing style because it could identify certain patterns, words, sentence structure, etc then as long as it’s not pretending to create things attributed to you, there’s no issue.
An AI doesn’t create works on its own. A human instructs AI to do so. Attribution is also irrelevant. If a human uses AI to recreate the exact tone, structure and other nuances of say, some best selling author, they harm the marketability of the original works which fails fair use tests (at least in the US).
Even if we accept all your market liberal premise without question… in your own rhetorical framework the Disney lawsuit should be ruled against Disney.
If a human uses AI to recreate the exact tone, structure and other nuances of say, some best selling author, they harm the marketability of the original works which fails fair use tests (at least in the US).
Says who? In a free market why is the competition from similar products and brands such a threat as to be outlawed? Think reasonably about what you are advocating… you think authorship is so valuable or so special that one should be granted a legally enforceable monopoly at the loosest notions of authorship. This is the definition of a slippery-slope, and yet, it is the status quo of the society we live in.
On it “harming marketability of the original works,” frankly, that’s a fiction and anyone advocating such ideas should just fucking weep about it instead of enforce overreaching laws on the rest of us. If you can’t sell your art because a machine made “too good a copy” of your art, it wasn’t good art in the first place and that is not the fault of the machine. Even big pharma doesn’t get to outright ban generic medications (even tho they certainly tried)… it is patently fucking absurd to decry artist’s lack of a state-enforced monopoly on their work. Why do you think we should extend such a radical policy towards… checks notes… tumblr artists and other commission based creators? It’s not good when big companies do it for themselves through lobbying, it wouldn’t be good to do it for “the little guy,” either. The real artists working in industry don’t want to change the law this way because they know it doesn’t work in their favor. Disney’s lawsuit is in the interest of Disney and big capital, not artists themselves, despite what these large conglomerates that trade in IPs and dreams might try to convince the art world writ large of.
you think authorship is so valuable or so special that one should be granted a legally enforceable monopoly at the loosest notions of authorship
Yes, I believe creative works should be protected as that expression has value and in a digital world it is too simple to copy and deprive the original author of the value of their work. This applies equally to Disney and Tumblr artists.
I think without some agreement on the value of authorship / creation of original works, it’s pointless to respond to the rest of your argument.
I think without some agreement on the value of authorship / creation of original works, it’s pointless to respond to the rest of your argument.
I agree, for this reason we’re unlikely to convince each other of much or find any sort of common ground. I don’t think that necessarily means there isn’t value in discourse tho. We probably agree more than you might think. I do think authors should be compensated, just for their actual labor. Art itself is functionally worthless, I think trying to make it behave like commodities that have actual economic value through means of legislation is overreach. It would be more ethical to accept the physical nature of information in the real world and legislate around that reality. You… literally can “download a car” nowadays, so to speak.
If copying someone’s work is so easily done why do you insist upon a system in which such an act is so harmful to the creators you care about?
Because it is harmful to the creators that use the value of their work to make a living.
There already exists a choice in the marketplace: creators can attach a permissive license to their work if they want to. Some do, but many do not. Why do you suppose that is?
Your very first statement calling my basis for my argument incorrect is incorrect lol.
LLMs “learn” things from the content they consume. They don’t just take the content in wholesale and keep it there to regurgitate on command.
On your last part, unless someone uses AI to recreate the tone etc of a best selling author *and then markets their book/writing as being from said best selling author, and doesn’t use trademarked characters etc, there’s no issue. You can’t copyright a style of writing.
I’ll repeat what you said with emphasis:
AI can “learn” from and “read” a book in the same way a person can and does
The emphasized part is incorrect. It’s not the same, yet your argument seems to be that because (your claim) it is the same, then it’s no different from a human reading all of these books.
Regarding your last point, copyright law doesn’t just kick in because you try to pass something off as an original (by, for ex, marketing a book as being from a best selling author). It applies based on similarity whether you mention the original author or not.
If what you are saying is true, why were these ‘AI’s” incapable of rendering a full wine glass? It ‘knows’ the concept of a full glass of water, but because of humanities social pressures, a full wine glass being the epitome of gluttony, art work did not depict a full wine glass, no matter how ai prompters demanded, it was unable to link the concepts until it was literally created for it to regurgitate it out. It seems ‘AI’ doesn’t really learn, but regurgitates art out in collages of taken assets, smoothed over at the seams.
“it was unable to link the concepts until it was literally created for it to regurgitate it out“
-WraithGear
The’ problem was solved before their patch. But the article just said that the model is changed by running it through a post check. Just like what deep seek does. It does not talk about the fundamental flaw in how it creates, they assert if does, like they always did
I don’t see what distinction you’re trying to draw here. It previously had trouble generating full glasses of wine, they made some changes, now it can. As a result, AIs are capable of generating an image of a full wine glass.
This is just another goalpost that’s been blown past, like the “AI will never be able to draw hands correctly” thing that was so popular back in the day. Now AIs are quite good at drawing hands, and so new “but they can’t do X!” Standards have been invented. I see no fundamental reason why any of those standards won’t ultimately be surpassed.
Copilot did it just fine
Bro are you a robot yourself? Does that look like a glass full of wine?
If someone ask for a glass of water you don’t fill it all the way to the edge. This is way overfull compared to what you’re supposed to serve.
Oh man…
That is the point, to show how AI image generators easily fail to produce something that rarely occurs out there in reality (i.e. is absent from training data), even though intuitively (from the viewpoint of human intelligence) it seems like it should be trivial to portray.
Omg are you an llm?
1 it’s not full, but closer then it was.
- I specifically said that the AI was unable to do it until someone specifically made a reference so that it could start passing the test so it’s a little bit late to prove much.
The concept of a glass being full and of a liquid being wine can probably be separated fairly well. I assume that as models got more complex they started being able to do this more.
You mean when the training data becomes more complete. But that’s the thing, when this issue was being tested, the’AI’ would swear up and down that the normally filled wine glasses were full, when it was pointed out that it was not indeed full, the ‘AI’ would agree, and change some other aspect of the picture it didn’t fully understand. You got wine glasses where the wine would half phase out of the bounds of the cup. And yet still be just as empty. No amount of additional checks will help without an appropriate reference
I use ‘AI’ extensively, i have one running locally on my computer, i swap out from time to time. I don’t have anything against its use with certain exceptions. But i can not stand people personifying it beyond its scope
Here is a good example. I am working on an APP so every once in a wile i will send it code to check. But i have to be very careful. The code it spits out will be unoptimized like: variable1=IF (variable2 IS true, true, false) .
Some have issues with object permanence, or the consideration of time outside its training data. Its like saying a computer can generate a true random number, by making the function to calculate a number more convoluted.
Ask a human to draw an orc. How do they know what an orc looks like? They read Tolkien’s books and were “inspired” Peter Jackson’s LOTR.
Unpopular opinion, but that’s how our brains work.
Fuck you, I won’t do what you tell me!
>.>
<.<
spoiler
I was inspired by the sometimes hilarious dnd splatbooks, thank you very much.
And this is how you know that the American legal system should not be trusted.
Mind you I am not saying this an easy case, it’s not. But the framing that piracy is wrong but ML training for profit is not wrong is clearly based on oligarch interests and demands.
If this is the ruling which causes you to lose trust that any legal system (not just the US’) aligns with morality, then I have to question where you’ve been all this time.
I could have been more clear, but it wasn’t my intention to imply that this particular case is the turning point.
The order seems to say that the trained LLM and the commercial Claude product are not linked, which supports the decision. But I’m not sure how he came to that conclusion. I’m going to have to read the full order when I have time.
This might be appealed, but I doubt it’ll be taken up by SCOTUS until there are conflicting federal court rulings.
If you are struggling for time, just put the opinion into chat GPT and ask for a summary. it will save you tonnes of time.
You should read the ruling in more detail, the judge explains the reasoning behind why he found the way that he did. For example:
Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable.
This isn’t “oligarch interests and demands,” this is affirming a right to learn and that copyright doesn’t allow its holder to prohibit people from analyzing the things that they read.
Except learning in this context is building a probability map reinforcing the exact text of the book. Given the right prompt, no new generative concepts come out, just the verbatim book text trained on.
So it depends on the model I suppose and if the model enforces generative answers and blocks verbatim recitation.
Again, you should read the ruling. The judge explicitly addresses this. The Authors claim that this is how LLMs work, and the judge says “okay, let’s assume that their claim is true.”
Fourth, each fully trained LLM itself retained “compressed” copies of the works it had trained upon, or so Authors contend and this order takes for granted.
Even on that basis he still finds that it’s not violating copyright to train an LLM.
And I don’t think the Authors’ claim would hold up if challenged, for that matter. Anthropic chose not to challenge it because it didn’t make a difference to their case, but in actuality an LLM doesn’t store the training data verbatim within itself. It’s physically impossible to compress text that much.
Yeah, but the issue is they didn’t buy a legal copy of the book. Once you own the book, you can read it as many times as you want. They didn’t legally own the books.
Right, and that’s the, “but faces trial over damages for millions of pirated works,” part that’s still up in the air.
People. ML AI’s are not a human. It’s machine. Why do you want to give it human rights?
Sounds like natural personhood for AI is coming
Do you think AIs spontaneously generate? They are a tool that people use. I don’t want to give the AIs rights, it’s about the people who build and use them.
LLMs don’t learn, and they’re not people. Applying the same logic doesn’t make much sense.
The judge isn’t saying that they learn or that they’re people. He’s saying that training falls into the same legal classification as learning.
Which doesn’t make any sense.
Argue it to the judge, I guess. That’s how the legal system works.
Isn’t part of the issue here that they’re defaulting to LLMs being people, and having the same rights as people? I appreciate the “right to read” aspect, but it would be nice if this were more explicitly about people. Foregoing copyright law because there’s too much data is also insane, if that’s what’s happening. Claude should be required to provide citations “each time they recall it from memory”.
Does Citizens United apply here? Are corporations people, and so LLMs are, too? If so, then imo we should be writing legal documents with stipulations like, “as per Citizens United” so that eventually, when they overturn that insanity in my dreams, all of this new legal precedence doesn’t suddenly become like a house of cards. Ianal.
Not even slightly, the judge didn’t rule anything like that. I’d suggest taking a read through his ruling, his conclusions start on page 9 and they’re not that complicated. In a nutshell, it’s just saying that the training of an AI doesn’t violate the copyright of the training material.
How Anthropic got the training material is a separate matter, that part is going to an actual try. This was a preliminary judgment on just the training part.
Foregoing copyright law because there’s too much data is also insane, if that’s what’s happening.
That’s not what’s happening. And Citizens United has nothing to do with this. It’s about the question of whether training an AI is something that can violate copyright.
But AFAIK they actually didn’t acquire the legal rights even to read the stuff they trained from. There were definitely cases of pirated books used to train models.
Yes, and that part of the case is going to trial. This was a preliminary judgment specifically about the training itself.
specifically about the training itself.
It’s two issues being ruled on.
Yes, as you mention, the act of training an LLM was ruled to be fair use, assuming that the digital training data was legally obtained.
The other part of the ruling, which I think is really, really important for everyone, not just AI/LLM companies or developers, is that it is legal to buy printed books and digitize them into a central library with indexed metadata. Anthropic has to go to trial on the pirated books they just downloaded from the internet, but has fully won the portion of the case about the physical books they bought and digitized.
I will admit this is not a simple case. That being said, if you’ve lived in the US (and are aware of local mores), but you’re not American. you will have a different perspective on the US judicial system.
How is right to learn even relevant here? An LLM by definition cannot learn.
Where did I say analyzing a text should be restricted?
How is right to learn even relevant here? An LLM by definition cannot learn.
I literally quoted a relevant part of the judge’s decision:
But Authors cannot rightly exclude anyone from using their works for training or learning as such.
I am not a lawyer. I am talking about reality.
What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning? Where is the learning happening? Who is doing the learning?
Who is stopping the individuals at the LLM company from learning or analysing a given book?
From my experience living in the US, this is pretty standard American-style corruption. Lots of pomp and bombast and roleplay of sorts, but the outcome is no different from any other country that is in deep need of judicial and anti-corruotion reform.
Well, I’m talking about the reality of the law. The judge equated training with learning and stated that there is nothing in copyright that can prohibit it. Go ahead and read the judge’s ruling, it’s on display at the article linked. His conclusions start on page 9.
This is an easy case. Using published works to train AI without paying for the right to do so is piracy. The judge making this determination is an idiot.
You’re right. When you’re doing it for commercial gain, it’s not fair use anymore. It’s really not that complicated.
If you’re using the minimum amount, in a transformative way that doesn’t compete with the original copyrighted source, then it’s still fair use even if it’s commercial. (This is not saying that’s what LLM are doing)
The judge making this determination is an idiot.
The judge hasn’t ruled on the piracy question yet. The only thing that the judge has ruled on is, if you legally own a copy of a book, then you can use it for a variety of purposes, including training an AI.
“But they didn’t own the books!”
Right. That’s the part that’s still going to trial.