Meta Admits Use of 'Pirated' Book Dataset to Train AI

aprnu@feddit.ch · 10 months ago

Meta Admits Use of 'Pirated' Book Dataset to Train AI

Lightrider@lemmynsfw.com · edit-2 10 months ago

Removed by mod

howrar@lemmy.ca · 10 months ago

I’m pretty sure “admits” implies an attempt to hide it. They’ve explicitly said in the model’s initial publication that the training set includes Books3.

msgraves@lemmy.dbzer0.com · 10 months ago

ohno my copyright!!! How will the publisher megacorps now make a record quarter??? Think of the shareholders!

Whom@midwest.social · 10 months ago

That’s fine, just let the rest of us do the same.

Fushuan [he/him]@lemm.ee · 10 months ago

Actually I prefer if individual users pirating being considere fair use, but corporation pirating not be considered fair use. So them pirating is not fine but us pirating should be.

jaden@lemmy.zip · 10 months ago

Yeah too much of this thread is so hypocritical, but either free to copy stuff should be free or it shouldn’t.

bartolomeo@suppo.fi · 10 months ago

“We didn’t do it, and if we did it was fair use, and if it wasn’t progress will be hampered if rules and regulations are too strict.”

Beardedsausag3@kbin.social · 10 months ago

Nope. Yer can feck off Zuck! Yer ain’t comin’ aboard my ship! 🏴‍☠️

Metal Zealot@lemmy.ml · 10 months ago

In the age of the internet, nothing is truly yours.

Just look at NFT’S

maynarkh@feddit.nl · 10 months ago

How are NFTs relevant?

fiah@discuss.tchncs.de · 10 months ago

they aren’t, except perhaps as a counterexample of some dubious sort

onlinepersona@programming.dev · 10 months ago

They were supposedly anchors to claim ownership of things in the real world.

CC BY-NC-SA 4.0

buckykat [none/use name]@hexbear.net · 10 months ago

Marking all your comments CC BY-NC-SA is a good bit.

The point of NFTs (beyond the pyramid scheme) was to enforce artificial digital scarcity at the individual level

rufus@discuss.tchncs.de · edit-2 10 months ago

AI is just too much of a hype. Every company invests millions into AI and all new products need to “have AI”. And then everybody also needs to file lawsuits. I mean rightly so if Meta just pirated the books, but that’s not a problem with AI, but plain old piracy.

I was pretty sure OpenAI or Meta didn’t license gigabytes of books correctly for use in their commercial products. Nice that Meta now admitted to it. I hope their " Fair Use" argument works and in the future we can all “train AI” with our “research dataset” of 40GB of ebooks. Maybe I’m even going to buy another harddisk and see if I can train an AI on 6 TB of tv series, all marvel movies and a broad mp3 collection.

Btw, there was no denying anyways. Meta wrote a scientific paper about their LLaMA model in march of last year. And they clearly listed all of their sources, including Books3. Other companies aren’t that transparent. And even less so as of today.

onlinepersona@programming.dev · 10 months ago

I do wonder how it shakes out. If the case establishes that a license to use the material should be acquired for copyrighted material, then maybe the license I’m setting on comments might bring commercial AI companies in hot water too - which I’d love. Opensource AI models FTW

CC BY-NC-SA 4.0

jarfil@beehaw.org · 10 months ago

That license would require the AI model to only output content under the same license. Not sure if you realize, but commercial use is part of the OpenSource definition:

https://opensource.org/osd/

Your content would just get filtered out from any training dataset.

As for going against commercial companies… maybe you are a lawyer, otherwise good luck paying the fees.