Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well.

AbuTahir@lemm.ee · edit-2 2 months ago

Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well.

Auli@lemmy.ca · 2 months ago

No shit. This isn’t new.

intensely_human@lemm.ee · 2 months ago

Fair, but the same is true of me. I don’t actually “reason”; I just have a set of algorithms memorized by which I propose a pattern that seems like it might match the situation, then a different pattern by which I break the situation down into smaller components and then apply patterns to those components. I keep the process up for a while. If I find a “nasty logic error” pattern match at some point in the process, I “know” I’ve found a “flaw in the argument” or “bug in the design”.

But there’s no from-first-principles method by which I developed all these patterns; it’s just things that have survived the test of time when other patterns have failed me.

I don’t think people are underestimating the power of LLMs to think; I just think people are overestimating the power of humans to do anything other than language prediction and sensory pattern prediction.

conicalscientist@lemmy.world · 2 months ago

This whole era of AI has certainly pushed the brink to existential crisis territory. I think some are even frightened to entertain the prospect that we may not be all that much better than meat machines who on a basic level do pattern matching drawing from the sum total of individual life experience (aka the dataset).

Higher reasoning is taught to humans. We have the capability. That’s why we spend the first quarter of our lives in education. Sometimes not all of us are able.

I’m sure it would certainly make waves if researchers did studies based on whether dumber humans are any different than AI.

Nalivai@lemmy.world · 2 months ago

You either an llm, or don’t know how your brain works.

And009@lemmynsfw.com · 2 months ago

LLMs don’t know how how they work

RampantParanoia2365@lemmy.world · 2 months ago

Fucking obviously. Until Data’s positronic brains becomes reality, AI is not actual intelligence.

HeyListenWatchOut@lemmy.world · 2 months ago

It’s an expensive carbon spewing parrot.

Threeme2189@lemmy.world · 2 months ago

It’s a very resource intensive autocomplete

GaMEChld@lemmy.world · 2 months ago

Most humans don’t reason. They just parrot shit too. The design is very human.

joel_feila@lemmy.world · 2 months ago

Thata why ceo love them. When your job is 90% spewing bs a machine that does that is impressive

skisnow@lemmy.ca · 2 months ago

I hate this analogy. As a throwaway whimsical quip it’d be fine, but it’s specious enough that I keep seeing it used earnestly by people who think that LLMs are in any way sentient or conscious, so it’s lowered my tolerance for it as a topic even if you did intend it flippantly.

SpaceCowboy@lemmy.ca · 2 months ago

Yeah I’ve always said the the flaw in Turing’s Imitation Game concept is that if an AI was indistinguishable from a human it wouldn’t prove it’s intelligent. Because humans are dumb as shit. Dumb enough to force one of the smartest people in the world take a ton of drugs which eventually killed him simply because he was gay.

jnod4@lemmy.ca · 2 months ago

I think that person had to choose between the drugs or hard core prison of the 1950s England where being a bit odd was enough to guarantee an incredibly difficult time as they say in England, I would’ve chosen the drugs as well hoping they would fix me, too bad without testosterone you’re going to be suicidal and depressed, I’d rather choose to keep my hair than to be horny all the time

Zenith@lemm.ee · 2 months ago

Yeah we’re so stupid we’ve figured out advanced maths, physics, built incredible skyscrapers and the LHC, we may as individuals be less or more intelligent but humans as a whole are incredibly intelligent

crunchy@lemmy.dbzer0.com · 2 months ago

I’ve heard something along the lines of, “it’s not when computers can pass the Turing Test, it’s when they start failing it on purpose that’s the real problem.”

El Barto@lemmy.world · 2 months ago

LLMs deal with tokens. Essentially, predicting a series of bytes.

Humans do much, much, much, much, much, much, much more than that.

Zexks@lemmy.world · 2 months ago

No. They don’t. We just call them proteins.

El Barto@lemmy.world · 2 months ago

“They”.

What are you?

stickly@lemmy.world · 2 months ago

You are either vastly overestimating the Language part of an LLM or simplifying human physiology back to the Greek’s Four Humours theory.

Zexks@lemmy.world · 2 months ago

No. I’m not. You’re nothing more than a protein based machine on a slow burn. You don’t even have control over your own decisions. This is a proven fact. You’re just an ad hoc justification machine.

BlaueHeiligenBlume@feddit.org · 2 months ago

Of course, that is obvious to all having basic knowledge of neural networks, no?

Endmaker@ani.social · 2 months ago

I still remember Geoff Hinton’s criticisms of backpropagation.

IMO it is still remarkable what NNs managed to achieve: some form of emergent intelligence.

skisnow@lemmy.ca · 2 months ago

What’s hilarious/sad is the response to this article over on reddit’s “singularity” sub, in which all the top comments are people who’ve obviously never got all the way through a research paper in their lives all trashing Apple and claiming their researchers don’t understand AI or “reasoning”. It’s a weird cult.

technocrit@lemmy.dbzer0.com · 2 months ago

ICYMI: A.I. is a Religious Cult with Karen Hao

Communist@lemmy.frozeninferno.xyz · edit-2 2 months ago

I think it’s important to note (i’m not an llm I know that phrase triggers you to assume I am) that they haven’t proven this as an inherent architectural issue, which I think would be the next step to the assertion.

do we know that they don’t and are incapable of reasoning, or do we just know that for x problems they jump to memorized solutions, is it possible to create an arrangement of weights that can genuinely reason, even if the current models don’t? That’s the big question that needs answered. It’s still possible that we just haven’t properly incentivized reason over memorization during training.

if someone can objectively answer “no” to that, the bubble collapses.

MouldyCat@feddit.uk · 2 months ago

In case you haven’t seen it, the paper is here - https://machinelearning.apple.com/research/illusion-of-thinking (PDF linked on the left).

The puzzles the researchers have chosen are spatial and logical reasoning puzzles - so certainly not the natural domain of LLMs. The paper doesn’t unfortunately give a clear definition of reasoning, I think I might surmise it as “analysing a scenario and extracting rules that allow you to achieve a desired outcome”.

They also don’t provide the prompts they use - not even for the cases where they say they provide the algorithm in the prompt, which makes that aspect less convincing to me.

What I did find noteworthy was how the models were able to provide around 100 steps correctly for larger Tower of Hanoi problems, but only 4 or 5 correct steps for larger River Crossing problems. I think the River Crossing problem is like the one where you have a boatman who wants to get a fox, a chicken and a bag of rice across a river, but can only take two in his boat at one time? In any case, the researchers suggest that this could be because there will be plenty of examples of Towers of Hanoi with larger numbers of disks, while not so many examples of the River Crossing with a lot more than the typical number of items being ferried across. This being more evidence that the LLMs (and LRMs) are merely recalling examples they’ve seen, rather than genuinely working them out.

Knock_Knock_Lemmy_In@lemmy.world · 2 months ago

do we know that they don’t and are incapable of reasoning.

“even when we provide the algorithm in the prompt—so that the model only needs to execute the prescribed steps—performance does not improve”

Communist@lemmy.frozeninferno.xyz · edit-2 2 months ago

That indicates that this particular model does not follow instructions, not that it is architecturally fundamentally incapable.

Knock_Knock_Lemmy_In@lemmy.world · 2 months ago

Not “This particular model”. Frontier LRMs s OpenAI’s o1/o3,DeepSeek-R, Claude 3.7 Sonnet Thinking, and Gemini Thinking.

The paper shows that Large Reasoning Models as defined today cannot interpret instructions. Their architecture does not allow it.

Communist@lemmy.frozeninferno.xyz · edit-2 2 months ago

those particular models. It does not prove the architecture doesn’t allow it at all. It’s still possible that this is solvable with a different training technique, and none of those are using the right one. that’s what they need to prove wrong.

this proves the issue is widespread, not fundamental.

0ops@lemm.ee · 2 months ago

Is “model” not defined as architecture+weights? Those models certainly don’t share the same architecture. I might just be confused about your point though

Communist@lemmy.frozeninferno.xyz · edit-2 2 months ago

It is, but this did not prove all architectures cannot reason, nor did it prove that all sets of weights cannot reason.

essentially they did not prove the issue is fundamental. And they have a pretty similar architecture, they’re all transformers trained in a similar way. I would not say they have different architectures.

Knock_Knock_Lemmy_In@lemmy.world · 2 months ago

The architecture of these LRMs may make monkeys fly out of my butt. It hasn’t been proven that the architecture doesn’t allow it.

You are asking to prove a negative. The onus is to show that the architecture can reason. Not to prove that it can’t.

Communist@lemmy.frozeninferno.xyz · edit-2 2 months ago

that’s very true, I’m just saying this paper did not eliminate the possibility and is thus not as significant as it sounds. If they had accomplished that, the bubble would collapse, this will not meaningfully change anything, however.

also, it’s not as unreasonable as that because these are automatically assembled bundles of simulated neurons.

Knock_Knock_Lemmy_In@lemmy.world · 2 months ago

This paper does provide a solid proof by counterexample of reasoning not occuring (following an algorithm) when it should.

The paper doesn’t need to prove that reasoning never has or will occur. It’s only demonstrates that current claims of AI reasoning are overhyped.

technocrit@lemmy.dbzer0.com · edit-2 2 months ago

Why would they “prove” something that’s completely obvious?

The burden of proof is on the grifters who have overwhelmingly been making false claims and distorting language for decades. Unfortunately the grifters and these researchers are the same people.

tauonite@lemmy.world · 2 months ago

That’s called science

yeahiknow3@lemmings.world · edit-2 2 months ago

They’re just using the terminology that’s widespread in the field. In a sense, the paper’s purpose is to prove that this terminology is unsuitable.

technocrit@lemmy.dbzer0.com · edit-2 2 months ago

I understand that people in this “field” regularly use pseudo-scientific language (I actually deleted that part of my comment).

But the terminology has never been suitable so it shouldn’t be used in the first place. It pre-supposes the hypothesis that they’re supposedly “disproving”. They’re feeding into the grift because that’s what the field is. That’s how they all get paid the big bucks.

TryingSomethingNew@lemmy.world · 2 months ago

Not when large swaths of people are being told to use it everyday. Upper management has bought in on it.

limelight79@lemmy.world · edit-2 2 months ago

Yep. I’m retired now, but before retirement a month or so ago, I was working on a project that relied on several hundred people back in 2020. “Why can’t AI do it?”

The people I worked with are continuing the research and putting it up against the human coders, but…there was definitely an element of “AI can do that, we won’t need people” next time. I sincerely hope management listens to reason. Our decisions would lead to potentially firing people, so I think we were able to push back on the “AI can make all of these decisions”…for now.

The AI people were all in, they were ready to build an interface that told the human what the AI would recommend for each item. Errrm, no, that’s not how an independent test works. We had to reel them back in.

TheRealKuni@midwest.social · 2 months ago

Why would they “prove” something that’s completely obvious?

I don’t want to be critical, but I think if you step back a bit and look and what you’re saying, you’re asking why we would bother to experiment and prove what we think we know.

That’s a perfectly normal and reasonable scientific pursuit. Yes, in a rational society the burden of proof would be on the grifters, but that’s never how it actually works. It’s always the doctors disproving the cure-all, not the snake oil salesmen failing to prove their own prove their own product.

There is value in this research, even if it fits what you already believe on the subject. I would think you would be thrilled to have your hypothesis confirmed.

postmateDumbass@lemmy.world · 2 months ago

The sticky wicket is the proof that humans (functioning ‘normally’) do more than pattern.

Hoimo@ani.social · 2 months ago

I think if you look at child development research, you’ll see that kids can learn to do crazy shit with very little input, waaay less than you’d need to train a neural net to do the same. So either kids are the luckiest neural nets and always make the correct adjustment after failing, or they have some innate knowledge that isn’t pattern-based at all.

There’s even some examples in linguistics specifically, where children tend towards certain grammar rules despite all evidence in their language pointing to another rule. Pure pattern-matching would find the real-world rule without first modelling a different (universally common) rule.

hornedfiend@sopuli.xyz · edit-2 2 months ago

While I hate LLMs with passion and my opinion of them boiling down to being glorified search engines and data scrapers, I would ask Apple: how sour are the grapes, eh?

edit: wording

atlien51@lemm.ee · 2 months ago

Employers who are foaming at the mouth at the thought of replacing their workers with cheap AI:

🫢

monkeyslikebananas2@lemmy.world · 2 months ago

Can’t really replace. At best, this tech will make employees more productive at the cost of the rainforests.

atlien51@lemm.ee · 2 months ago

Yes but asshole employers haven’t realized this yet

ZILtoid1991@lemmy.world · 2 months ago

Thank you Captain Obvious! Only those who think LLMs are like “little people in the computer” didn’t knew this already.

TheFriar@lemm.ee · 2 months ago

Yeah, well there are a ton of people literally falling into psychosis, led by LLMs. So it’s unfortunately not that many people that already knew it.

joel_feila@lemmy.world · 2 months ago

Dude they made chat gpt a little more boit licky and now many people are convinced they are literal messiahs. All it took for them was a chat bot and a few hours of talk.

Pulptastic@midwest.social · 2 months ago

Naich@lemmings.world · 2 months ago

So they have worked out that LLMs do what they were programmed to do in the way that they were programmed? Shocking.

1rre@discuss.tchncs.de · 2 months ago

The difference between reasoning models and normal models is reasoning models are two steps, to oversimplify it a little they prompt “how would you go about responding to this” then prompt “write the response”

It’s still predicting the most likely thing to come next, but the difference is that it gives the chance for the model to write the most likely instructions to follow for the task, then the most likely result of following the instructions - both of which are much more conformant to patterns than a single jump from prompt to response.

kescusay@lemmy.world · 2 months ago

But it still manages to fuck it up.

I’ve been experimenting with using Claude’s Sonnet model in Copilot in agent mode for my job, and one of the things that’s become abundantly clear is that it has certain types of behavior that are heavily represented in the model, so it assumes you want that behavior even if you explicitly tell it you don’t.

Say you’re working in a yarn workspaces project, and you instruct Copilot to build and test a new dashboard using an instruction file. You’ll need to include explicit and repeated reminders all throughout the file to use yarn, not NPM, because even though yarn is very popular today, there are so many older examples of using NPM in its model that it’s just going to assume that’s what you actually want - thereby fucking up your codebase.

I’ve also had lots of cases where I tell it I don’t want it to edit any code, just to analyze and explain something that’s there and how to update it… and then I have to stop it from editing code anyway, because halfway through it forgot that I didn’t want edits, just explanations.

Riskable@programming.dev · edit-2 2 months ago

To be fair, the world of JavaScript is such a clusterfuck… Can you really blame the LLM for needing constant reminders about the specifics of your project?

When a programming language has five hundred bazillion absolutely terrible ways of accomplishing a given thing—and endless absolutely awful code examples on the Internet to “learn from”—you’re just asking for trouble. Not just from trying to get an LLM to produce what you want but also trying to get humans to do it.

This is why LLMs are so fucking good at writing rust and Python: There’s only so many ways to do a thing and the larger community pretty much always uses the same solutions.

JavaScript? How can it even keep up? You’re using yarn today but in a year you’ll probably like, “fuuuuck this code is garbage… I need to convert this all to [new thing].”

kescusay@lemmy.world · 2 months ago

That’s only part of the problem. Yes, JavaScript is a fragmented clusterfuck. Typescript is leagues better, but by no means perfect. Still, that doesn’t explain why the LLM can’t recall that I’m using Yarn while it’s processing the instruction that specifically told it to use Yarn. Or why it tries to start editing code when I tell it not to. Those are still issues that aren’t specific to the language.

snooggums@lemmy.world · 2 months ago

I’ve also had lots of cases where I tell it I don’t want it to edit any code, just to analyze and explain something that’s there and how to update it… and then I have to stop it from editing code anyway, because halfway through it forgot that I didn’t want edits, just explanations.

I find it hilarious that the only people these LLMs mimic are the incompetent ones. I had a coworker that changed things when asked to explain constantly.

brsrklf@jlai.lu · 2 months ago

You know, despite not really believing LLM “intelligence” works anywhere like real intelligence, I kind of thought maybe being good at recognizing patterns was a way to emulate it to a point…

But that study seems to prove they’re still not even good at that. At first I was wondering how hard the puzzles must have been, and then there’s a bit about LLM finishing 100 move towers of Hanoï (on which they were trained) and failing 4 move river crossings. Logically, those problems are very similar… Also, failing to apply a step-by-step solution they were given.

auraithx@lemmy.dbzer0.com · 2 months ago

This paper doesn’t prove that LLMs aren’t good at pattern recognition, it demonstrates the limits of what pattern recognition alone can achieve, especially for compositional, symbolic reasoning.

technocrit@lemmy.dbzer0.com · edit-2 2 months ago

Computers are awesome at “recognizing patterns” as long as the pattern is a statistical average of some possibly worthless data set. And it really helps if the computer is setup to ahead of time to recognize pre-determined patterns.

Harbinger01173430@lemmy.world · 2 months ago

XD so, like a regular school/university student that just wants to get passing grades?

Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well.

Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well.

archive.is