-
Called this awhile back, this is why Reddit has such a high evaluation.
-
Poisoning your data won’t do anything but give them more data, do you seriously think reddit servers don’t track every edit you make to posts? You’d literally just be providing training data of original human vs poisoned. They’d still have your original post, and they have a copy of everytime you edit it.
-
Whoever buys reddit will have sole access to one of the larger (I don’t think largest though) pools of text training Data on the internet, with full licensed usage of it. I expect someone like Google, FB, MS, OpenAI, etc would pay big $$$ for that.
“But can’t people already scrape it?”
-
Well yes, but it’s at best legally dubious in some places
-
Scraping Data off reddit only gets you current versions of posts (which means you can get poisoned dara, and cant see deleted content), and is extremely slow… if you own the server you have first class access to all posts in a database, including g the originals and diffs of everytime soneone edited a post, and all the deleted posts too.
Think about if you perhaps wanted to train an AI to detect posts that require flagging for moderation, if you scrape reddit data, you can’t find deleted posts that got moderated…
But, if you have the raw original data, you 100% would have a list of every post that got deleted by mods and even the mod message on why it was deleted
You surely can see the value of such data, that only owners of reddit are currently privy to atm…
Poison it by randomly posting copywrited materials by big corps like Disney?
Once again the day is saved by piracy.🏴☠️
Bee Movie script. Millions of times
sigh
So the old trick of “search term +reddit” no longer will work then huh?
I’ve already made a habit of adding date limiters to web results from before before LLMs were made public… The SEO ‘optimization’ game of before was bearable, but the LLM spam just ruins so many search results with regurgitated garbage or teaspoon deep information
During the peak of the great purge, it was quickly becoming pointless. A lot of results were bringing up deleted posts. It took a while for search engines to catch up and start filtering a lot of those results out.
search term +reddit
tossing site:reddit.com before any search will guarantee all results come from reddit, if that’s what you’re looking for.
Ahhh my bad, that’s what I meant
Sounds like something a bunch of governments would be interested in. As you pointed out you get to see why human mods made certain decisions. Could you an edge in manipulation.
They’ve also got vote counts and breakdowns of who is making those votes. This data will be worth more for AI training than any similar volume of data other than maybe the contents of Wikipedia. Assuming they didn’t have it set up to delete the vote breakdowns when they archived threads.
Why are those breakdowns worth so much? Because they can be used to build profiles on each voter (including those who only had lurker accounts to vote with), so they can build AIs that know how to speak with the MAGA cult, Republicans who aren’t MAGA, liberals, moderates, centrists, socialists, communists, anarchists. Not only that, they’ll be able to look at how sentiments about various things changed over time with each of these groups, watch people move from one to another as their opinions evolved, see how someone pretends to be a member of whatever group (assuming they voted honestly and posted under their fake persona).
Oh and also, all of that data is available through the fediverse but it’s free to train on to anyone who sets up a server. Which makes me question whether the fediverse is a good thing because even changing federation to opt-in instead of opt-out just covers whether your server accepts data from another. It’s always shared.
Open and private are on opposite sides of a spectrum. You can’t have both, best you can do is settle for something in the middle.
What if reddit also kept all deleted comments and post, im sure there are shit loads of things people type out just to delete, thinking all the while it’ll never see the light of day.
I’d be surprised if they don’t keep all of that. There were a number of sites for looking at deleted posts. They’d just go and grab everything and compare what was still there with what wasn’t and highlight the stuff that wasn’t there anymore.
Which is also possible here, though the mod log reduces the need for it. But if someone is looking for posts people change their mind about wanting anyone to see, deleting it highlights it instead of hides it for anyone who is watching for that.
I think that site was unddit, but yes those were posted then later deleted. Im talking about just typing out a post or comment and never posting just simply backing out of the page or hitting cancel. Im not just if any of that is stored on the site or just locally.
Oh, yeah, I’ve wondered the same myself. Hell, that might have been a motivation for removing the API access.
You would be able to tell by monitoring the network tab of the browser developer tools. If post requests are being made (which they probably are, though I’m too lazy to go check) while you are typing a comment, they are most likely saving work in progress records for comments.
They definitely do, it’s common for such systems to never actually delete anything because storage is cheap. It likely just is flagged
deleted=true
and the searches just returnWHERE [post].Deleted = False
on queries on the backend.So it looks deleted to the consumer, but it’s all saved and squirreled away on the backend.
It’s good to keep all this shit for both legal reasons (if someone posts illegal stuff then deletes it, you still can give it to the feds), as well as auditing (mods can’t just delete stuff to cover it up, the original still exists and admins can see it)
The problem (for most) was never that people’s public posts/comments were being used for AI training, it was that someone else was claiming ownership over them and being paid for access, and the resulting AI was privately owned. The fediverse was always about avoiding the pitfalls of private ownership, not privacy.
It’s exhausting constantly being “that guy,” but it really needs to be said constantly; private ownership is at the core of nearly every major issue in the 21st century.
The same goes for piracy and copyright. The same goes for DMCA circumvention and format shifting content you own. The same goes for proprietary tech ecosystems and walled gardens. Private ownership is at the core of the most contentious practices in the 21st century, and if we don’t address it shit like this will just keep happening.
Which makes me question whether the fediverse is a good thing
I’d argue it’s good, because it means open source AI has a fighting chance with FOSS data to train on without needing to fork over a morbillion dollars to Reddits owners.
Whatever use cases the reddit data can train on, FOSS researchers can repeat it on Lemmy data and release free models that average joes can use on their own without having to subscribe to shit like Microsoft Copilot and friends to stay relevant.
request your reddit data and they deliver you every comment you ever made
You’re not wrong. But on point #1, you’re just an asshole
In regards to the editing part, sure, I’m sure they can track your edit history. However, on a large scale, most edits are going to be to correct things. To determine if an edit was to poison the text, it would likely require manual review and flagging. There’s no way they’re going to sift through all of the edits on individual accounts to determine this, so it’s still worthwhile to do.
Although they could sidestep the issue a bit by simply comparing the changes between edits. Huge changes could just be discarded, while minor ones are fine.
-
I stopped using reddit after they dropped the bomb on the devs and I’m not a fan of the company.
I understand the hatred towards them, but this is definitely expected from a company like reddit, and any other social media for that matter. As users we must be aware that we don’t own the content in their platform.
I wouldn’t be surprised if the same story comes from Instagram tomorrow, though I suppose there will be a bigger outcry then.
Don’t know if it was against usage terms, but I have been able to get chatgpt answers written ‘in the style of’ various subreddits since the initial release (or perhaps the second release)
Honestly over the last year since the great migration, the discussions on lemmy have really grown and matured to the point where i don’t really see the value of reddit anymore
For me there’s still value in the niche communities like r/rimworld and the like, but for everything else I’m firmly on Lemmy now
The real value of reddit for me lies in its cache of information contained in answers to questions from over the years. Whenever I’m looking online for a solution to a problem I’m trying to solve I’ll eventually add “reddit” to the search and I almost always find the answer that way.
The only use I have for Reddit anymore is for super niche information. For example we were planning to go to Six Flags Discovery Kingdom today but it’s going to rain this afternoon. I checked their site and it said they were open 11-6, my BIL checked their app and at 11:30 it said they were currently closed. Found a Reddit post from someone confirming the park was closed for the weekend, and we didn’t waste a trip up. (as an extra annoying aside, apparently this information was posted on Six Flag’s Instagram page, because expecting a huge company to maintain a website is I guess just too much when they can offload it to social media.)
If user content belongs to the service provider, one would think that they are responsible for it.
It will get trained on some comment posts.
Let reddit die. Join Lemmy or /kbin. https://join-lemmy.org/ https://kbin.pub/
Um, we’re already here. You should be posting this on reddit instead.
I did that some months ago already, changed all my comment posts.
If they build an AI based on reddit content it will be the devil incarnate.
If you thought gpt4 was confidently incorrect wait until you see this next ai.
This
A devil incarnate that makes a lot of puns.
Can’t wait to hear the fan fiction the AI bot generates
So I need to run any comments I make to reddit by chatgpt before posting, it seems. I heard ai training ai leads to a poisoned data set.
For text, AI training AI wouldn’t be all that great for giving data sets a little poison ivy rubdown, because at the end of the day, the message is still moderated by a non bot. I think a better way would be to write more unconventionally, but heavily contextual so that if specifics texts are ripped and tossed into the bot blender, it’ll make no sense without the context alongside it.
Slang, edge case wording, and verbing non verbs would likely do a lot of heavy lifting in that department.
Using LLMs for corporate communications - automatically-generated complaint responses, and the like - usually has swearing disabled, so if you want to fuck up their shit, be sure to express yourself with as many fucking swears as possible. Let’s get that shit into those cunt’s language models ASAP.
I bet they can scrape Lemmy content for free then. There are no legal mechanisms to prevent them from doing so.
Yes but i think reddit is many times more valuable than Lemmy. I just haven’t found the same level of very specific subreddits that have lots and lots of activity. Most of the traffic here is memes, politics, news and Linux lovin. On reddit if I needed to find a community about my local town it’s no problem and there are tens or hundreds of daily posts. The same community does exist on Lemmy but the last post was 6 months ago.
Hm but don’t you automatically own the stuff you create yourself, as long as you don’t consent to giving it away? I don’t know the terms and conditions of my Lemmy instance though.
When was the last time anyone read the T&Cs of a social media website?
They basically all have a clause to the effect that you grant them a permanent, irrevocable license do whatever they want with anything you post.
You might still own the copyright to any content you produce, but by posting you’re granting them permission to do basically anything with it, including reselling it.
Well there’s copyright law. There’s already lawsuits happening so we’ll have to see how this shakes out.
But even if the AI companies lose the lawsuits, I think it’s likely they’ll still have access to content where the T&C of the site says they’re allowed to sell the data.
I rather my data I’ve chosen to make public is free and accessible to all, than it being sold to the highest bidder.
Although I am not pleased that my content is packaged into a proprietary AI, and sold for money.
I think there are ways to opt-out of AI collection, at least for big companies. I wonder if it is implemented in Lemmy-UI and/or terms and conditions.
You opt-out so that there is less free training data, making Reddit’s data all the more valuable. I’m sure spez will be thankful.
on the other hand, if there’s troves of free data, that takes the upper hand from the companies that can afford paying for it, and gives open source a much better chance at staying competitive.
If you’re not paying for the product, you are the product.
And even when you pay for the product, you are the product, because capitalism requires infinite growth from a finite system.
I assume AI is training off the content here for free.
It’s all federated, so it would be strange the bots didn’t scrape anything off.
I was curious if a
robots.txt
equivalent exists for AI training data, and there was some solid points here:If I go to your writing, I read it & learn from it. Your writing influences my future writing. We’ve been okay with this as long as it’s not a blatant forgery.
If a computer goes to your writing, it reads it & learns from it. Your writing influences its future writing. It seems we are not okay with this, even if it isn’t blatant forgery.
[AI at the moment is] different because the company is re-using your material to create a product they are going to sell. I’m not sure if I believe that is so different than a human employee doing the same thing.
https://news.ycombinator.com/item?id=34324208
I still think we should have the ability to opt out like we do with search engines and webcrawlers, but if the algorithm works ideally and learns but does not recycle content, is it truly any different from a factory of workers pumping out clones of popular series on Amazon? I honestly don’t know the answer to that.
This is kinda my take on it. However, the way I see it is that the AI isn’t intelligent enough yet to truly create something original. As such, right now AI is closer to being a tool than a being. Because of that, it somewhat bothers me that I’m being used to teach a tool. If I thought that companies like OpenAI were truly trying to create beings and not tools, then I’d feel differently.
It’s kinda nuanced, but a being can voluntarily determine whether or not something is copyright infringing, understand why that might be an issue, and then decide whether or not to continue writing based on that. A tool can’t really do that. You can try and add filters to a tool to avoid writing copy written text, but that will have flaws and holes in it. A being who understands what it’s writing and what makes it plagiarism vs reference vs homage/inspiration/whatever is less likely to have those issues.
The problem is not the technology, the problem is the businesses and the people behind them.
These tools were made with the explicit purpose of taking the content that they did not create, repurposing them, and creating a product. Throw all these conversation about intelligence and learning out the fucking window, what matters is what the thing does, and why it was created to do that thing.
Until we reach a point where there is some sort of AI out there that has any semblance of free will, and can choose not to learn if fed certain information, and choose not to respond to input given to it without being programmed to do not respond, then we are not talking about intelligence, we are talking about a tool. No matter how they dress it up.
Stop arguing about this on their terms, because they’re gaslighting the fuck out of you.
Afaik the OpenAI bot may choose to ignore it? At least that’s what another user claimed it does.
Robots.txt has been always ignored by some bots, it’s just a guideline originally meant to prevent excessive bandwidth usage by search indexing bots and is entirely voluntary.
Archive.org bot for example has completely ignored it since 2017.
Yes, but there’s no contract to give them legal cover if anyone ever does anything about all the content they steal.
And ya know what? Frankly, if AI is going to harvest all this shit, I’d rather fuckers like spez couldn’t get rich off it in the process. Granted I’m not happy the tech bros running these AI companies are getting rich with these fucking things, but I can at least take solace there isn’t some asshole middle man making bank of the work and words of users they never paid a dime to.
Genuinely, why does Sepz and Reddit deserve to make money off anything we posted? Why does any social media site? They make the site, pay for the servers, maintain the apps, sure, and they can get compensation for that, I don’t see a problem there. But why does any social media company deserve to get rich when the only thing that makes their platform valuable is the people that post to it? Reddit didn’t even have paid mods, the community did all the work on the content of that site, why in the fuck do we tolerate these assholes making profit off it like this?
Intellectual property theft
This is sad to read because I agree with all of it (except the casual sexism).
why in the fuck do we tolerate these assholes making profit off it like this?
Look at this thread. People delete their posts on Reddit. Which means that they can no longer be scraped for free. Which means they are now exclusively available in Reddit’s archive. It’s not that people tolerate it. It’s that the first instinct of people who don’t tolerate it, is to make it worse. What can you do?
What do you mean? What legal cover do they need against what actions?
If the EU (or any other governments) decide that AI can’t legally train their models on information they don’t own or license (I don’t know how that would work legally but they talk about it), then this company that Reddit has sold access to could argue to lawmakers that they have license for all the content on Reddit. I don’t know that it would hold up, but I suspect it’s part of the company’s perceived value in this Reddit deal.
I barely post on reddit, just lurk but this made me finally sign up for an account here.
deleted by creator
Welcome to lemmy.
I just Googled my reddit handle and it it’s appalling that I found websites on the internet that archived a bunch of my posts on there including pictures I posted. I’m not sure what I expected, but it’s still kinda annoying.
That’s been an issue for a long time. Fake “blogs” made of scraped reddit posts.
That’s why spez the hurensohn “refreshed” the T&Cs very recently.
So nothing realy new after alls half reddit is repost bot .
Lol, what do you think Lemmy is? There’s a lot of posts on here directly scraped from Reddit by bots.
Well of course, that’s the #1 reason why everyone stopped providing free-to-use APIs last year. Because AI companies were getting all that data for free via those APIs.
oh, really
Just going to replace all my old posts with AI generated poison data.