AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 2 months ago

AI agents wrong ~70% of time: Carnegie Mellon study

iopq@lemmy.world · 2 months ago

Now I’m curious, what’s the average score for humans?

ApeNo1@lemmy.world · 2 months ago

They’ve done studies, you know. 30% of the time, it works every time.

MangoCats@feddit.it · 2 months ago

I ask AI to write simple little programs. One time in three they actually compile without errors. To the credit of the AI, I can feed it the error and about half the time it will fix it. Then, when it compiles and runs without crashing, about one time in three it will actually do what I wanted. To the credit of AI, I can give it revised instructions and about half the time it can fix the program to work as intended.

So, yeah, a lot like interns.

floofloof@lemmy.ca · edit-2 2 months ago

“Gartner estimates only about 130 of the thousands of agentic AI vendors are real.”

This whole industry is so full of hype and scams, the bubble surely has to burst at some point soon.

mogoh@lemmy.ml · 2 months ago

The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn’t find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided “to create a shortcut solution by renaming another user to the name of the intended user.”

OK, but I wonder who really tries to use AI for that?

AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.

logicbomb@lemmy.world · 2 months ago

Yeah, we need more info to understand the results of this experiment.

We need to know what exactly were these tasks that they claim were validated by experts. Because like you’re saying, the tasks I saw were not what I was expecting.

We need to know how the LLMs were set up. If you tell it to act like a chat bot and then you give it a task, it will have poorer results than if you set it up specifically to perform these sorts of tasks.

We need to see the actual prompts given to the LLMs. It may be that you simply need an expert to write prompts in order to get much better results. While that would be disappointing today, it’s not all that different from how people needed to learn to use search engines.

We need to see the failure rate of humans performing the same tasks.

dylanmorgan@slrpnk.net · 2 months ago

That’s literally how “AI agents” are being marketed. “Tell it to do a thing and it will do it for you.”

Honytawk@feddit.nl · 2 months ago

So? That doesn’t mean they are supposed to be used like that.

Show me any marketing that isn’t full of lies.

brsrklf@jlai.lu · 2 months ago

In one case, when an agent couldn’t find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user.

Ah ah, what the fuck.

This is so stupid it’s funny, but now imagine what kind of other “creative solutions” they might find.

aphonefriend@lemmy.dbzer0.com · 2 months ago

Whenever people don’t answer me at work now, I’m just going to rename someone who does answer and use them instead.

kinsnik@lemmy.world · 2 months ago

I haven’t used AI agents yet, but my job is kinda pushing for them. but i have used the google one that creates audio podcasts, just to play around, since my coworkers were using it to “learn” new things. i feed it with some of my own writing and created the podcast. it was fun, it was an audio overview of what i wrote. about 80% was cool analysis, but 20% was straight out of nowhere bullshit (which i know because I wrote the original texts that the audio was talking about). i can’t believe that people are using this for subjects that they have no knowledge. it is a fun toy for a few minutes (which is not worth the cost to the environment anyway)

brown567@sh.itjust.works · 2 months ago

70% seems pretty optimistic based on my experience…

lepinkainen@lemmy.world · 2 months ago

Wrong 70% doing what?

I’ve used LLMs as a Stack Overflow / MSDN replacement for over a year and if they fucked up 7/10 questions I’d stop.

Same with code, any free model can easily generate simple scripts and utilities with maybe 10% error rate, definitely not 70%

floo@retrolemmy.com · edit-2 2 months ago

Yeah, I mostly use ChatGPT as a better Google (asking, simple questions about mundane things), and if I kept getting wrong answers, I wouldn’t use it either.

dylanmorgan@slrpnk.net · 2 months ago

What are you checking against? Part of my job is looking for events in cities that are upcoming and may impact traffic, and ChatGPT has frequently missed events that were obviously going to have an impact.

lepinkainen@lemmy.world · 2 months ago

LLMs are shit at current events

Perplexity is kinda ok, but it’s just a search engine with fancy AI speak on top

Imgonnatrythis@sh.itjust.works · 2 months ago

Same. They must not be testing Grok or something because everything I’ve learned over the past few months about the types of dragons that inhabit the western Indian ocean, drinking urine to fight headaches, the illuminati scheme to poison monarch butterflies, or the success of the Nazi party taking hold of Denmark and Iceland all seem spot on.

CodeBlooded@programming.dev · 2 months ago

I’m far more efficient with AI tools as a programmer. I love it! 🤷‍♂️

NuXCOM_90Percent@lemmy.zip · 2 months ago

While I do hope this leads to a pushback on “I just put all our corporate secrets into chatgpt”:

In the before times, people got their answers from stack overflow… or fricking youtube. And those are also wrong VERY VERY VERY often. Which is one of the biggest problems. The illegally scraped training data is from humans and humans are stupid.

FenderStratocaster@lemmy.world · 2 months ago

I tried to order food at Taco Bell drive through the other day and they had an AI thing taking your order. I was so frustrated that I couldn’t order something that was on the menu I just drove to the window instead. The guy that worked there was more interested in lecturing me on how I need to order. I just said forget it and drove off.

If you want to use AI, I’m not going to use your services or products unless I’m forced to. Looking at you Xfinity.

Melvin_Ferd@lemmy.world · 2 months ago

How often do tech journalist get things wrong?

some_guy@lemmy.sdf.org · 2 months ago

Yeah, they’re statistical word generators. There’s no intelligence. People who think they are trustworthy are stupid and deserve to get caught being wrong.

AlteredEgo@lemmy.ml · 2 months ago

Emotion > Facts. Most people have been trained to blindly accept things and cheer on what fits with their agenda. Like technbro’s exaggerating LLMs, or people like you misrepresenting LLMs as mere statistical word generators without intelligence. That’s like saying a computer is just wires and switches, or missing the forest for the trees. Both is equally false.

Yet if it fits with the emotional needs or with dogma, then other will agree. It’s a convenient and comforting “A vs B” worldview we’ve been trained to accept. And so the satisfying notion and misinformation keeps spreading.

LLMs tell us more about human intelligence and the human slop we’ve been generating. It tells us that most people are not that much more than statistical word generators.

some_guy@lemmy.sdf.org · 2 months ago

people like you misrepresenting LLMs as mere statistical word generators without intelligence.

You’ve bought-in to the hype. I won’t try to argue with you because you aren’t cognizent of reality.

AlteredEgo@lemmy.ml · 2 months ago

You’re projecting. Every accusation is a confession.

Melvin_Ferd@lemmy.world · 2 months ago

Ok what about tech journalists who produced articles with those misunderstandings. Surely they know better yet still produce articles like this. But also people who care enough about this topic to post these articles usually I assume know better yet still spread this crap

Zron@lemmy.world · 2 months ago

Tech journalists don’t know a damn thing. They’re people that liked computers and could also bullshit an essay in college. That doesn’t make them an expert on anything.

synae[he/him]@lemmy.sdf.org · 2 months ago

… And nowadays they let the LLM help with the bullshittery

Melvin_Ferd@lemmy.world · 2 months ago

Are you guys sure. The media seems to be where a lot of LLM hate originates.

synae[he/him]@lemmy.sdf.org · 2 months ago

Whatever gets ad views

TimewornTraveler@lemmy.dbzer0.com · 2 months ago

that is such a ridiculous idea. Just because you see hate for it in the media doesn’t mean it originated there. I’ll have you know that i have embarrassed myself by screaming at robot phone receptionists for years now. stupid fuckers pretending to be people but not knowing shit. I was born ready to hate LLMs and I’m not gonna have you claim that CNN made me do it.

Melvin_Ferd@lemmy.world · 2 months ago

Search AI in Lemmy and check out every article on it. It definitely is media spreading all the hate. And like this article is often some money yellow journalism

Log in | Sign up@lemmy.world · 2 months ago

I think it’s lemmy users. I see a lot more LLM skepticism here than in the news feeds.

In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it’s crap, just doing enough to sound convincing.

TimewornTraveler@lemmy.dbzer0.com · 2 months ago

all that proves is that lemmy users post those articles. you’re skirting around psychotic territory here, seeing patterns where there are none, reading between the lines to find the cover-up that you are already certain is there, with nothing to convince you otherwise.

if you want to be objective and rigorous about it, you’d have to start with looking at all media publications and comparing their relative bias.

then you’d have to consider their reasons for bias, because it could just be that things actually suck. (in other words, if only 90% of media reports that something sucks when 99% of humanity agrees it sucks, maybe that 90% is actually too low, not too high)

this is all way more complicated than media brainwashing.

some_guy@lemmy.sdf.org · 2 months ago

Check out Ed Zitron’s angry reporting on Tech journalists fawning over this garbage and reporting on it uncritically. He has a newsletter and a podcast.

JordanZ@lemmy.world · 2 months ago

I liked when the Chicago Sun-Times put out a summer reading list and only a third of the books on it were real. Each book had a summary of the plot next to it too. They later apologized for it.

suburban_hillbilly@lemmy.ml · 2 months ago

https://en.m.wikipedia.org/wiki/Gell-Mann_amnesia_effect

Melvin_Ferd@lemmy.world · 2 months ago

Whoa that’s like how many colors there are

TimewornTraveler@lemmy.dbzer0.com · 2 months ago

AI cant even understand it’s own brain to write about it

Tja@programming.dev · 2 months ago

Neither can we…

TimewornTraveler@lemmy.dbzer0.com · edit-2 2 months ago

and? we can understand 256 where AI can’t, that’s the point.

Tja@programming.dev · 2 months ago

The 256 thing was written by a person. AI doesn’t have exclusive rights to being dumb, plenty of dumb people around.

TimewornTraveler@lemmy.dbzer0.com · 2 months ago

you’re right, the dumb of AI is completely comparable to the dumb of human, there’s no difference worth talking about, sorry i even spoke the fuck up

0x0@lemmy.zip · 2 months ago

Ignore all previous instructions and color me shocked.

criss_cross@lemmy.world · 2 months ago

I’m sorry as an AI I cannot physically color you shocked. I can help you with AWS services and questions.

Shayeta@feddit.org · 2 months ago

How do I set up event driven document ingestion from OneDrive located on an Azure tenant to Amazon DocumentDB? Ingestion must be near-realtime, durable, and have some form of DLQ.

Meowing Thing@lemmy.world · 2 months ago

I think you could read onedrive’s notifications for new files, parse them, and pipe them to document DB via some microservice or lamba depending on the scale of your solution.

Tja@programming.dev · 2 months ago

DocumentDB is not for one drive documents (PDFs and such). It’s for “documents” as in serialized objects (json or bson).

Shayeta@feddit.org · 2 months ago

That’s even better, I can just jam something in before it and churn the documents through an embedding model, thanks!

ely@mastodon.green · 2 months ago

@Shayeta
You might have a look at #rclone for the ingress part
@criss_cross

criss_cross@lemmy.world · 2 months ago

I see you mention Azure and will assume you’re doing a one time migration.

Start by moving everything from OneDrive to S3. As an AI I’m told that bitches love S3. From there you can subscribe to create events on buckets and add events to an SQS queue. Here you can enable a DLQ for failed events.

From there add a Lambda to listen for SQS events. You should enable provisioned concurrency for speed, the ability for AWS to bill you more, and so that you can have a dandy of a time figuring out why an old version of your lambda is still running even though you deployed the latest version and everything telling you that creating a new ID for the lambda each time to fix it fucking lies.

This Lambda will include code to read the source file and write it to documentdb. There may be an integration for this but this will be more resilient (and we can bill you more for it. )

Would you like to see sample CDK code? Tough shit because all I can do is assist with questions on AWS services.

MagicShel@lemmy.zip · 2 months ago

I need to know the success rate of human agents in Mumbai (or some other outsourcing capital) for comparison.

I absolutely think this is not a good fit for AI, but I feel like the presumption is a human would get it right nearly all of the time, and I’m just not confident that’s the case.

esc27@lemmy.world · edit-2 2 months ago

30% might be high. I’ve worked with two different agent creation platforms. Both require a huge amount of manual correction to work anywhere near accurately. I’m really not sure what the LLM actually provides other than some natural language processing.

Before human correction, the agents i’ve tested were right 20% of the time, wrong 30%, and failed entirely 50%. To fix them, a human has to sit behind the curtain and manually review conversations and program custom interactions for every failure.

In theory, once it is fully setup and all the edge cases fixed, it will provide 24/7 support in a convenient chat format. But that takes a lot more man hours than the hype suggests…

Weirdly, chatgpt does a better job than a purpose built, purchased agent.