Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments With Any (Non-Copyrighted) Text

Kory@lemmy.ml · 2 years ago

Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments With Any (Non-Copyrighted) Text

Olap@lemmy.world · 2 years ago

Reddit is almost certainly going to throw your old comments to them if you edit stuff. We’re pretty fucked. And if you think Lemmy is any different, guess again. We agreed to send our comments to everyone else in the fediverse, plenty of bad actors and a legal minefield allows LLMs to do what they want essentially. The good news is that LLMs are all crap, and people are slowly realising this

maegul (he/they)@lemmy.ml · 2 years ago

I’ve been harping on about this for a while on the fediverse … private/closed/non-open spaces really ought to be thought about more. Fortunately, lemmy core devs are implementing local only and private communities (local only is already done IIRC).

Yes they introduce their own problems with discovery and gating etc. But now that the internet’s “you’re the product” stakes have gone beyond what could have been construed as a reasonably transaction, “my attention on an ad … for a service”, to “my mind’s products to be aggregated into an energy sucking job replacing AI … for a service” … well it’s time to normalise closing that door on opportunistic tech capitalists.

SorteKanin@feddit.dk · 2 years ago

And if you think Lemmy is any different, guess again

Lemmy is different, in that the data is not being sold to anyone. Instead, the data is available to anyone.

It’s kind of like open source software. Nobody can buy it, cause it’s open and free to be used by anyone. Nobody profits off of it more than anyone else - nobody has an advantage over anyone else.

Open source levels the playing field by making useful code available to everyone. You can think of comments and posts on the Fediverse in the same way - nobody can buy that data, because it’s open and free to be used by anyone. Nobody profits off of it more than anyone else and nobody has an advantage over anyone else (after all, everyone has access to the same data).

The only problem is if you’re okay with your data being out there and available in this way… but if you’re not, you probably shouldn’t be on the internet at all.

tabular@lemmy.world · 2 years ago

If the post is creative then it’s automatically copyrighted in many countries. That doesn’t stop people collecting it and using it to train ML (yet).

asret@lemmy.zip · 2 years ago

Copyright has little to say in regards to training models - it’s the published output that matters.

kernelle@lemmy.world · 2 years ago

LLMs are all crap, and people are slowly realising this

LLM’s have already changed more than anything else in the tech space for the last 10 years at least. I get what you’re trying to say but that opinion will age like milk.

my_hat_stinks@programming.dev · 2 years ago

They’ll use old comments either way, using an up-to-date dataset means using a dataset already tainted by LLM-generated content. Training a model on its own output is not great.

Incidentally this also makes Lemmy data less valuable, most of Lemmy’s popularity came after the rise of LLMs so there’s no significant untainted data from before LLMs.

TORFdot0@lemmy.world · 2 years ago

LLMs are great for anything you’d trust to an 8 year old savant.

It’s great for getting quick snippets of code using languages and methods that have great documentation. I don’t think I’d trust it for real work though

Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments With Any (Non-Copyrighted) Text

Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments With Any (Non-Copyrighted) Text

The Luddite