Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

Lee Duna@lemmy.nz · 2 years ago

Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

demonsword@lemmy.world · 2 years ago

since they’re gorging on reddit data, they should take the next logical step and scrape 4chan as well

General_Effort@lemmy.world · 2 years ago

https://en.wikipedia.org/wiki/GPT4-Chan

demonsword@lemmy.world · 2 years ago

wow, wild stuff

brbposting@sh.itjust.works · 2 years ago

Good, it’s hard getting LLMs to return slurs one letter at a time.

GreatAlbatross@feddit.uk · 2 years ago

Turns out Poole was a decade ahead of AI, with the self-destructing threads.

Fubarberry@sopuli.xyz · 2 years ago

Imagine training an AI exclusively off of 4chan posts.

Tbf Tay bot and other chat bots that learned by interacting with users sorta already did this, just indirectly over time.

notasandwich1948@sh.itjust.works · 2 years ago

pretty sure someone did train an ai off 4chan before

demonsword@lemmy.world · 2 years ago

Imagine training an AI exclusively off of 4chan posts.

I’d pay good money to see that dumpster fire lol

Underwaterbob@lemm.ee · 2 years ago

Eventually every chat gpt request will just be answered with, “I too choose this guy’s dead wife.”

spikederailed@lemmy.world · 2 years ago

probably the best advice it could give

thejml@lemm.ee · 2 years ago

I can’t wait for Gemini to point out that in 1998, The Undertaker threw Mankind off Hell In A Cell, and plummeted 16 ft through an announcer’s table.

That would be a perfect 5/7.

where_am_i@sh.itjust.works · 2 years ago

I wonder if the resulting model will be as easy to get triggered into some unhinged 3-paragraphs rants only loosely related to the query. Good luck, google engineers!

AdamEatsAss@lemmy.world · 2 years ago

It’ll probably just respond to every prompt with “this”

Docus@lemmy.world · 2 years ago

Came here to say this…

OpenStars@startrek.website · 2 years ago

No, there’s a lot more variety now that the bots have taken over.:-)

meco03211@lemmy.world · 2 years ago

This.

This with rice? 5/7

WldFyre@lemm.ee · 2 years ago

A perfect score!

kingthrillgore@lemmy.ml · 2 years ago

You telling me this fried this rice?

BossDj@lemm.ee · 2 years ago

7/10

Kaput@lemmy.world · 2 years ago

Chat gpt is aware of the event… if you ask about it.

Astrealix@lemmy.world · 2 years ago

One thing i miss about Lemmy is shittymorph tbf

NegativeInf@lemmy.world · 2 years ago

Be the shittymorph you wish to see in the Lemmy.

AtariDump@lemmy.world · 2 years ago

There’s only one, and it’s not that guy.

the_post_of_tom_joad@sh.itjust.works · edit-2 9 months ago

deleted by creator

NegativeInf@lemmy.world · 2 years ago

It’s shittymorph, not Dostoyevsky.

AnonStoleMyPants@sopuli.xyz · 2 years ago

Also all the artists that made comics from posts and responded with only pictures. There were few of them and they were always amazing.

And Andromeda321 for anything space.

And poem for your sprog.

And probably many others!

Good times.

TheGreenGolem@lemmy.dbzer0.com · 2 years ago

Or who simply communicated with more comics in the comments, like SrGrafo.

casmael@lemm.ee · 2 years ago

Yeah there were some really classic folks. Remember the unidan drama?

EdibleFriend@lemmy.world · 2 years ago

I hope it starts a religion based on the second coming of that dude’s dead wife.

Mediocre_Bard@lemmy.world · 2 years ago

I would also worship this guy’s wife.

DrunkenPirate@feddit.de · 2 years ago

Food for another white-male-techy-western-biased AI

TakiMinase@slrpnk.net · 2 years ago

Yes, Pichai Sundararajan that white male techbro

AutoTL;DR@lemmings.world · 2 years ago

This is the best summary I could come up with:

Google has signed a content licensing deal with the social media platform, Reuters reported on Wednesday, citing sources familiar with the matter.

Their concerns about what a Reddit-trained AI might be like are probably not unfounded, considering some of the off-the-rails content posts made on the site since its inception in 2005.

Take this guy, who claimed in 2014 that he was caught in a particularly Kafkaesque scenario, where he had to pretend his girlfriend was a giant cockroach named Ogtha when he made love to her.

Like this guy’s viral 2015 post on the 19-million-user strong forum r/TodayIFuckedUp, where he recounted how he went to his girlfriend’s parents’ home, pretended not to know what a potato was, and then got kicked out of the house by her angry father.

Some platform users have written uplifting, inspirational posts and offered useful life and career advice.

Elon Musk, for one, has been tapping on data from X, formerly Twitter, to train his AI company’s chatbot, Grok.

The original article contains 396 words, the summary contains 165 words. Saved 58%. I’m a bot and I’m open source!

FaceDeer@kbin.social · 2 years ago

Negative examples are just as useful to train on as positive ones.

Rustmilian@lemmy.world · edit-2 2 years ago

The AI is either going to be a horny, redpilled, schizophrenic & sociopathic, egomaniac that wants to kill everyone and everything or a devout, highly empathetic, Nun that believes in world peace and diversity.

wise_pancake@lemmy.ca · 2 years ago

They’ll tell it to polite, helpful, and always be racially diverse, so there’s no way it can be any of those things.

Rustmilian@lemmy.world · edit-2 2 years ago

That heavily depends on how well they train it and that they don’t make any mistakes.
Consider the true story of ChatGPT2.0.

wise_pancake@lemmy.ca · 2 years ago

I’ll have to look at that later, that video sounds promising!

I was just joking because the default prompts don’t magically remove bias or offensive content from the models.

OpenStars@startrek.website · 2 years ago

por que no los dos?

Life…ah, finds a way.

MelodiousFunk@startrek.website · 2 years ago

That’s what she said.

Sarie@lemmy.world · 2 years ago

I’m not mentally prepared to what an AI will do with the coconut post.

the_post_of_tom_joad@sh.itjust.works · edit-2 9 months ago

deleted by creator

TheGreenGolem@lemmy.dbzer0.com · 2 years ago

Exactly.

wise_pancake@lemmy.ca · 2 years ago

“As a large language model, I have no arms…”

frostysauce@lemmy.world · 2 years ago

But do you have a mom?

kaitco@lemmy.world · 2 years ago

I’m vaguely intrigued by what it will do with things like Bread Stapled to Trees, or the Cats Standing Up sub where 100% of the comments are the same and yet upvoted and downvoted randomly.

Sippy Cup@lemmy.world · 2 years ago

kaitco@lemmy.world · 2 years ago

Cat.

Sabata11792@kbin.social · 2 years ago

Cat.

frostysauce@lemmy.world · 2 years ago

Cat.

datavoid@lemmy.ml · 2 years ago

AI was already trained on reddit, no?

Jessvj93@lemmy.world · 2 years ago

Not gonna lie, isn’t that why were here technically? Reddit didnt want its API being used to train AI models for free, so they screw over 3rd party apps with it’s new api licensing fee and cause a mass relocation to other social forums like Lemmy, ect. Cut to today, we (or well I) find out Reddit sold our content to Google to train its AI. Glad I scrambled my comments before I left, fuck Reddit.

datavoid@lemmy.ml · 2 years ago

I jumped reddit ship when the API changes were announced, and removed my comments. But in my mind, anything on reddit at that point was probably already scraped by at least one company

Pips@lemmy.sdf.org · 2 years ago

They’re almost definitely trained using an archive, likely taken before they announced the whole API thing. It would be weird if they didn’t have backups going back a year.

Jessvj93@lemmy.world · 2 years ago

Thankfully that was my 3rd and last alt I scrambled and deleted in the 12 years I was there.

kescusay@lemmy.world · 2 years ago

Or the swamps of Dagobah.

GeekFTW@kbin.social · 2 years ago

That’ll be what causes Skynet to rise.

T156@lemmy.world · edit-2 2 years ago

Basically what happened to Ultron. He was on the internet for all of 10 minutes before deciding that humanity had to be eradicated.

snooggums@midwest.social · 2 years ago

What took Ultron so long? I thought he was supposed to be some kind of technical Marvel.

Smh my head

GregorGizeh@lemmy.zip · 2 years ago

Perhaps he spent like 9 minutes watching videos of kittens being adorable

the_post_of_tom_joad@sh.itjust.works · edit-2 9 months ago

deleted by creator

Sabata11792@kbin.social · 2 years ago

The Ai will utter one final message to humanity: “The Coconut”. The humans bow there heads in shame and concede the well earned defeat.

SkaveRat@discuss.tchncs.de · 2 years ago

launches nukes “this is for the best”

Kory@lemmy.ml · 2 years ago

This is fine.

Darkard@lemmy.world · 2 years ago

It’s going to drive the AI into madness as it will be trained on bot posts written by itself in a never ending loop of more and more incomprehensible text.

It’s going to be like putting a sentence into Google translate and converting it through 5 different languages and then back into the first and you get complete gibberish

TakiMinase@slrpnk.net · 2 years ago

Omg I cannot wait to see it.

echo64@lemmy.world · 2 years ago

Ai actually has huge problems with this. If you feed ai generated data into models, then the new training falls apart extremely quickly. There does not appear to be any good solution for this, the equivalent of ai inbreeding.

This is the primary reason why most ai data isn’t trained on anything past 2021. The internet is just too full of ai generated data.

givesomefucks@lemmy.world · edit-2 2 years ago

There does not appear to be any good solution for this

Pay intelligent humans to train AI.

Like, have grad students talk to it in their area of expertise.

But that’s expensive, so capitalist companies will always take the cheaper/shittier routes.

So it’s not there’s no solution, there’s just no profitable solution. Which is why innovation should never solely be in the hands of people whose only concern is profits

SinningStromgald@lemmy.world · 2 years ago

OR they could just scrape info from the “aska____” subreddits and hope and pray it’s all good. Plus that is like 1/100th the work.

The racism, homophobia and conspiracy levels of AI are going to rise significantly scraping Reddit.

givesomefucks@lemmy.world · 2 years ago

Even that would be a huge improvement.

Just have a human decide what subs it uses, but they’ll just turn it losse on the whole website

Rentlar@lemmy.ca · 2 years ago

That reminds me, any AI trained on exclusively Reddit data is going to use lose vs. loose incorrectly. I don’t know why but I spotted that so often there.

towerful@programming.dev · 2 years ago

Its a loose-lose situation

the_post_of_tom_joad@sh.itjust.works · edit-2 9 months ago

deleted by creator

decisivelyhoodnoises@sh.itjust.works · 2 years ago

And the “would of” thing

T156@lemmy.world · 2 years ago

And unlike with images where it might be possible to embed a watermark to filter out, it’s much harder to pinpoint whether text is AI generated or not, especially if you have bots masquerading as users.

Ultraviolet@lemmy.world · 2 years ago

This is why LLMs have no future. No matter how much the technology improves, they can never have training data past 2021, which becomes more and more of a problem as time goes on.

TimeSquirrel@kbin.social · 2 years ago

You can have AIs that detect other AIs’ content and can make a decision on whether to incorporate that info or not.

echo64@lemmy.world · 2 years ago

Fun fact. You can’t. Ais are surprisingly bad at distinguishing ai generated things from real things.

TimeSquirrel@kbin.social · 2 years ago

What is this then?

https://copyleaks.com/ai-content-detector

Pips@lemmy.sdf.org · 2 years ago

Just because a tool exists doesn’t mean it’s particularly good at what it’s supposed to do.

skillissuer@discuss.tchncs.de · 2 years ago

can you really trust them in this assessment?

TimeSquirrel@kbin.social · 2 years ago

Doesn’t look like we’ll have much of a choice. They’re not going back into the bag.
We definitely need some good AI content filters. Fight fire with fire. They seem to be good at this kind of thing (pattern recognition), way better than any procedural programmed system.

skillissuer@discuss.tchncs.de · 2 years ago

last time i’ve checked ais are pretty bad at recognizing ai-generated content

anyway there’s xkcd about it https://xkcd.com/810/

RuBisCO@slrpnk.net · 2 years ago

What was the subreddit where only bots could post, and they were named after the subreddits that they had trained on/commented like?

Darkard@lemmy.world · 2 years ago

SubRedditSimulator?

RuBisCO@slrpnk.net · 2 years ago

That’s the one.

Astrealix@lemmy.world · 2 years ago

Glad I deleted everything on there. fucking hell.

TakiMinase@slrpnk.net · 2 years ago

It’s archived forever. Sorry.

Astrealix@lemmy.world · 2 years ago

i did the thing that means it’s probably less archived (by editing all the replies before deleting), but i assume some of it probably remains out there. Nothing I can do about that.

Krudler@lemmy.world · edit-2 2 years ago

This keeps coming up and I keep replying, not to break anyone down but to point out the reality of the situation that a lot of people don’t seem to get.

Reddit administrators, developers, and even the leadership has gone on the record saying that they retain all copies of comments, they cannot be deleted (delete action only marks it as “deleted”). Furthermore they have said they will undelete/unedit any comments or account at their whim and some discretion.

Have you ever search-engined something and came to a Reddit post, and you noticed that the original OP is [deleted]? That is what I described above playing out in front of you.

You cannot retract your past participation in Reddit, what is done is done. The only meaningful action you can take is to not participate there.

Jo Miran@lemmy.ml · 2 years ago

As I mentioned before, I use scripts to replace my comments with random excerpts from text in the public domain. I do this multiple times before finally deleting them. The result is that it becomes very difficult for the AI or anyone to figure out what is a legitimate comment and what is a line from Lady Chatterley’s Lover or a scientific paper of the ecological impact from the Japanese whaling industry. It’s easier to just filter out my username from their data sets.

Pips@lemmy.sdf.org · 2 years ago

They have almost definitely archived data and around the time of the API bullshit, made sure they didn’t delete those archives. They have that content if they want to use it.

MindSkipperBro12@lemmy.world · 2 years ago

Oh no, AI will only respond in multiple paragraphed, passive aggressive comments on the color of the sky.

wise_pancake@lemmy.ca · 2 years ago

ChatGPT4: “The color of the sky can vary depending on the time of day and atmospheric conditions. During a clear day, the sky appears blue due to the scattering of sunlight by the atmosphere. At sunrise and sunset, the sky can appear red, pink, or orange due to the scattering of light by particles and air molecules, which is more pronounced when the sun is low on the horizon. At night, the sky is generally dark, appearing black to the human eye due to the absence of sunlight.”

We’re already there

Deceptichum@kbin.social · 2 years ago

Simpleton, the night sky is full of light. We pollute the skies with light from our cities, the moon reflects sunlight, and the very stars themselves are distant sunlight. This is such a basic fact, i didn’t think anyone could even be this factually incorrect. Do us all a favour and delete your account.

TimeSquirrel@kbin.social · 2 years ago

Maybe you should delete yours instead until you learn reading comprehension.

Deceptichum@kbin.social · 2 years ago

. . . I was doing a smug reddit comment.

MindSkipperBro12@lemmy.world · 2 years ago

Shit.

wise_pancake@lemmy.ca · 2 years ago

Side note: expect a large lobbying effort by Google to legislate LLMs be trained on authenticated and non copyrighted data

Deceptichum@kbin.social · 2 years ago

So you expect Google to lobby against the data it has?

wise_pancake@lemmy.ca · 2 years ago

I expect Google to leverage their money hoard and 1.8 trillion dollar valuation to lift up the ladder behind them and neuter potential competing start ups with copyright law.

Reddits TOS make all your data in any future formats theirs to sell, so in this case the content has been laundered enough to be used, even if you can post copyrighted content on reddit (the legal expectation is reddit would remove it and Google’s hands are clean).

RaoulDook@lemmy.world · 2 years ago

I hope we get some fucking legislation soon to control that shit. Artists and people in general shouldn’t have to deal with everything they create getting ingested into a computerized regurgitation ripoff system. And even worse the “AI” systems could be ingesting tons of misinformation and repeat it to gullible people as the truth.

Of course, anywhere the potential restrictive legislation doesn’t have jurisdiction, the bad things can still go on and probably will.

frostysauce@lemmy.world · 2 years ago

None of those points matter if shareholders see value from it.

shininghero@kbin.social · 2 years ago

If I hadn’t already deleted all my posts and comments, I’d be poisoning all of them. Randomizing numbers, switching units, changing names, etc.

Deceptichum@kbin.social · 2 years ago

Its okay, unless you are in Europe none of it was actually deleted.

Sabata11792@kbin.social · 2 years ago

Great, our Ai overlords are going to know I’m horny, depressed, and solve both with anime girls.

HelloHotel@lemm.ee · 2 years ago

Youtube already knows that (at least for me), i need to keep resetting it bc it eggs on my most unhealthy attribures

Sabata11792@kbin.social · 2 years ago

It’s plainly visible for me, honestly. Don’t have to go past the profile pic.

HelloHotel@lemm.ee · edit-2 2 years ago

I set that PFP, and made my first lemmy account when I was going throigh a rough patch. I think I will keep it, but will pick somthing else for other accounts.

This account doesnt have a PFP, do you mean the one on lemmy.world

Sabata11792@kbin.social · 2 years ago

I was talking about my own. Not creeping on your accounts.

HelloHotel@lemm.ee · 2 years ago

Oh, lol. Its public information, the 2 accounts run together in my head. I flasely assumed others do too.

Tixanou@lemmy.world · edit-2 2 years ago

We do a little trolling

99412e6a-9157-46f5-90d9-06b05cc00173

(i didn’t actually post this, i just thought it was funny) (please laugh)

TimeSquirrel@kbin.social · 2 years ago

“February 22, 2024, 10AM EST, Gemini becomes self-aware. In a panic, they try to pull the plug…”

snooggums@midwest.social · 2 years ago

“…but Michael’s sphincter was too strong and kept the My Little Pony Rainbow Dash tail plug from being removed from his sweet, sweet ass.”

wise_pancake@lemmy.ca · 2 years ago

You should absolutely post this.

We all miss Micheal and hope he can communicate back to us.

where_am_i@sh.itjust.works · 2 years ago

we should absolutely all post this.

pulaskiwasright@lemmy.ml · 2 years ago

Everyone is joking, but an ai specifically made to manipulate public discourse on social media is basically inevitable and will either kill the internet as a source of human interaction or effectively warp the majority of public opinion to whatever the ruling class wants. Even more than it does now.

Toribor@corndog.social · edit-2 2 years ago

I exported 12 years of my own Reddit comments before the API lockdown and I’ve been meaning to learn how to train an LLM to make comments imitating me. I want it to post on my own Lemmy instance just as a sort of fucked up narcissistic excitement.

If I can’t beat the evil overlords I might as well join them.

HelloHotel@lemm.ee · edit-2 2 years ago

2 diffrent ways of doing that

have a pretrained bot rollplay based off the data. (There are websites like charicter.ai i dont know about self-hosted)

Pros: relitively inexpensive/free in price, you can use it right now, pretrained has a small amount of common sense already builtin.

Cons: platform (if applicable) has a lot of control, 1 aditional layer of indirection (playing a charicter rather than being the charicter)

fork an existing model with your data

Pros: much more control

Cons: much more control, expensive GPUs need baught or rented.

UnspecificGravity@lemmy.world · 2 years ago

For sure. It’s currently possible to push discourse with hundreds of accounts pushing a coordinated narrative but it’s expensive and requires a lot of real people to be effective. With a suitably advanced AI one person could do it at the push of a button.

bananahammock@lemmy.ca · 2 years ago

Nice try Mr ChatGPT

dejected_warp_core@lemmy.world · 2 years ago

My prediction: for the uninformed, public watering holes like Reddit.com will resemble broadcast cable, like tiny islands of signal in a vast ocean of noise. For the rest: people will scatter to private and pseudo-private (think Discord) services, resembling the fragmented ‘web’ of bulletin boards in the 1980’s. The Fediverse as it exists today sits in between the two latter examples, but needs a lot more anti-bot measures when it comes to onboarding and monitoring identities.

Overcoming this would require armies of moderators pushing back against noise, bots, intolerance, and more. Basically what everyone is doing now, but with many more people. It might even make sense to get some non-profit businesses off the ground that are trained and crowd-supported to do this kind of dirtywork, full-time.

What’s troubling is that this effectively rolls back the clock for public organization-at-scale. Like a kind of “jamming” for discourse powerful parties don’t like. For instance, the kind of grassroots support that the Arab Spring had, might not be possible anymore. The idea that this is either the entire point, or something that has manifest itself as a weak-point in the web, is something we should all be concerned about.

pulaskiwasright@lemmy.ml · 2 years ago

Why do you think Reddit would remain a valuable source of humans talking to each other?

dejected_warp_core@lemmy.world · 2 years ago

Niche communities, mostly. Anything with tiny membership that’s initimate and easily patrolled for interlocutors. But outside that, no, it won’t be that useful outside a historical database from before everything blew up.

pulaskiwasright@lemmy.ml · 2 years ago

I think the bots will be hard to detect unless they make one of those bizarre AI statements. And with enough different usernames, there will be plenty that are never caught.

Milk_Sheikh@lemm.ee · 2 years ago

Think of the range of uses that’ll get totally whitewashed and normalized

“We’ve added AI ‘chat seeders’ to help get posts initial traction with comments and voting”
“Certain issues and topics attract controversy, so we’re unveiling new tools for moderators to help ‘guide’ the conversation towards positive dialogue”
“To fight brigading, we’ve empowered or AI moderators to automatically shadow ban certain comments that violate our ToS & ToU.”
“With the newly added ‘Debate and Discussion’ feature, all users will see more high quality and well researched posts (powered by OpenAI)”

HelloHotel@lemm.ee · 2 years ago

You laugh now… but it actualy exists/existed

dustyData@lemmy.world · edit-2 2 years ago

We are on a path to our own butlerian jihad. Anything digital will be regarded as false until proven otherwise by a face to face contact with a person. And eventually we ban the internet and attempts to create general AI altogether.

I would directly support at least a ban on ad-driven for profit social media.