LOOK MAA I AM ON FRONT PAGE
No shit. This isn’t new.
Fair, but the same is true of me. I don’t actually “reason”; I just have a set of algorithms memorized by which I propose a pattern that seems like it might match the situation, then a different pattern by which I break the situation down into smaller components and then apply patterns to those components. I keep the process up for a while. If I find a “nasty logic error” pattern match at some point in the process, I “know” I’ve found a “flaw in the argument” or “bug in the design”.
But there’s no from-first-principles method by which I developed all these patterns; it’s just things that have survived the test of time when other patterns have failed me.
I don’t think people are underestimating the power of LLMs to think; I just think people are overestimating the power of humans to do anything other than language prediction and sensory pattern prediction.
This whole era of AI has certainly pushed the brink to existential crisis territory. I think some are even frightened to entertain the prospect that we may not be all that much better than meat machines who on a basic level do pattern matching drawing from the sum total of individual life experience (aka the dataset).
Higher reasoning is taught to humans. We have the capability. That’s why we spend the first quarter of our lives in education. Sometimes not all of us are able.
I’m sure it would certainly make waves if researchers did studies based on whether dumber humans are any different than AI.
You either an llm, or don’t know how your brain works.
LLMs don’t know how how they work
Fucking obviously. Until Data’s positronic brains becomes reality, AI is not actual intelligence.
It’s an expensive carbon spewing parrot.
It’s a very resource intensive autocomplete
Most humans don’t reason. They just parrot shit too. The design is very human.
Thata why ceo love them. When your job is 90% spewing bs a machine that does that is impressive
I hate this analogy. As a throwaway whimsical quip it’d be fine, but it’s specious enough that I keep seeing it used earnestly by people who think that LLMs are in any way sentient or conscious, so it’s lowered my tolerance for it as a topic even if you did intend it flippantly.
Yeah I’ve always said the the flaw in Turing’s Imitation Game concept is that if an AI was indistinguishable from a human it wouldn’t prove it’s intelligent. Because humans are dumb as shit. Dumb enough to force one of the smartest people in the world take a ton of drugs which eventually killed him simply because he was gay.
I think that person had to choose between the drugs or hard core prison of the 1950s England where being a bit odd was enough to guarantee an incredibly difficult time as they say in England, I would’ve chosen the drugs as well hoping they would fix me, too bad without testosterone you’re going to be suicidal and depressed, I’d rather choose to keep my hair than to be horny all the time
Yeah we’re so stupid we’ve figured out advanced maths, physics, built incredible skyscrapers and the LHC, we may as individuals be less or more intelligent but humans as a whole are incredibly intelligent
I’ve heard something along the lines of, “it’s not when computers can pass the Turing Test, it’s when they start failing it on purpose that’s the real problem.”
LLMs deal with tokens. Essentially, predicting a series of bytes.
Humans do much, much, much, much, much, much, much more than that.
No. They don’t. We just call them proteins.
“They”.
What are you?
You are either vastly overestimating the Language part of an LLM or simplifying human physiology back to the Greek’s Four Humours theory.
No. I’m not. You’re nothing more than a protein based machine on a slow burn. You don’t even have control over your own decisions. This is a proven fact. You’re just an ad hoc justification machine.
Of course, that is obvious to all having basic knowledge of neural networks, no?
I still remember Geoff Hinton’s criticisms of backpropagation.
IMO it is still remarkable what NNs managed to achieve: some form of emergent intelligence.
What’s hilarious/sad is the response to this article over on reddit’s “singularity” sub, in which all the top comments are people who’ve obviously never got all the way through a research paper in their lives all trashing Apple and claiming their researchers don’t understand AI or “reasoning”. It’s a weird cult.
I think it’s important to note (i’m not an llm I know that phrase triggers you to assume I am) that they haven’t proven this as an inherent architectural issue, which I think would be the next step to the assertion.
do we know that they don’t and are incapable of reasoning, or do we just know that for x problems they jump to memorized solutions, is it possible to create an arrangement of weights that can genuinely reason, even if the current models don’t? That’s the big question that needs answered. It’s still possible that we just haven’t properly incentivized reason over memorization during training.
if someone can objectively answer “no” to that, the bubble collapses.
In case you haven’t seen it, the paper is here - https://machinelearning.apple.com/research/illusion-of-thinking (PDF linked on the left).
The puzzles the researchers have chosen are spatial and logical reasoning puzzles - so certainly not the natural domain of LLMs. The paper doesn’t unfortunately give a clear definition of reasoning, I think I might surmise it as “analysing a scenario and extracting rules that allow you to achieve a desired outcome”.
They also don’t provide the prompts they use - not even for the cases where they say they provide the algorithm in the prompt, which makes that aspect less convincing to me.
What I did find noteworthy was how the models were able to provide around 100 steps correctly for larger Tower of Hanoi problems, but only 4 or 5 correct steps for larger River Crossing problems. I think the River Crossing problem is like the one where you have a boatman who wants to get a fox, a chicken and a bag of rice across a river, but can only take two in his boat at one time? In any case, the researchers suggest that this could be because there will be plenty of examples of Towers of Hanoi with larger numbers of disks, while not so many examples of the River Crossing with a lot more than the typical number of items being ferried across. This being more evidence that the LLMs (and LRMs) are merely recalling examples they’ve seen, rather than genuinely working them out.
do we know that they don’t and are incapable of reasoning.
“even when we provide the algorithm in the prompt—so that the model only needs to execute the prescribed steps—performance does not improve”
That indicates that this particular model does not follow instructions, not that it is architecturally fundamentally incapable.
Not “This particular model”. Frontier LRMs s OpenAI’s o1/o3,DeepSeek-R, Claude 3.7 Sonnet Thinking, and Gemini Thinking.
The paper shows that Large Reasoning Models as defined today cannot interpret instructions. Their architecture does not allow it.
those particular models. It does not prove the architecture doesn’t allow it at all. It’s still possible that this is solvable with a different training technique, and none of those are using the right one. that’s what they need to prove wrong.
this proves the issue is widespread, not fundamental.
Is “model” not defined as architecture+weights? Those models certainly don’t share the same architecture. I might just be confused about your point though
It is, but this did not prove all architectures cannot reason, nor did it prove that all sets of weights cannot reason.
essentially they did not prove the issue is fundamental. And they have a pretty similar architecture, they’re all transformers trained in a similar way. I would not say they have different architectures.
The architecture of these LRMs may make monkeys fly out of my butt. It hasn’t been proven that the architecture doesn’t allow it.
You are asking to prove a negative. The onus is to show that the architecture can reason. Not to prove that it can’t.
that’s very true, I’m just saying this paper did not eliminate the possibility and is thus not as significant as it sounds. If they had accomplished that, the bubble would collapse, this will not meaningfully change anything, however.
also, it’s not as unreasonable as that because these are automatically assembled bundles of simulated neurons.
This paper does provide a solid proof by counterexample of reasoning not occuring (following an algorithm) when it should.
The paper doesn’t need to prove that reasoning never has or will occur. It’s only demonstrates that current claims of AI reasoning are overhyped.
Why would they “prove” something that’s completely obvious?
The burden of proof is on the grifters who have overwhelmingly been making false claims and distorting language for decades. Unfortunately the grifters and these researchers are the same people.
That’s called science
They’re just using the terminology that’s widespread in the field. In a sense, the paper’s purpose is to prove that this terminology is unsuitable.
I understand that people in this “field” regularly use pseudo-scientific language (I actually deleted that part of my comment).
But the terminology has never been suitable so it shouldn’t be used in the first place. It pre-supposes the hypothesis that they’re supposedly “disproving”. They’re feeding into the grift because that’s what the field is. That’s how they all get paid the big bucks.
Not when large swaths of people are being told to use it everyday. Upper management has bought in on it.
Yep. I’m retired now, but before retirement a month or so ago, I was working on a project that relied on several hundred people back in 2020. “Why can’t AI do it?”
The people I worked with are continuing the research and putting it up against the human coders, but…there was definitely an element of “AI can do that, we won’t need people” next time. I sincerely hope management listens to reason. Our decisions would lead to potentially firing people, so I think we were able to push back on the “AI can make all of these decisions”…for now.
The AI people were all in, they were ready to build an interface that told the human what the AI would recommend for each item. Errrm, no, that’s not how an independent test works. We had to reel them back in.
Why would they “prove” something that’s completely obvious?
I don’t want to be critical, but I think if you step back a bit and look and what you’re saying, you’re asking why we would bother to experiment and prove what we think we know.
That’s a perfectly normal and reasonable scientific pursuit. Yes, in a rational society the burden of proof would be on the grifters, but that’s never how it actually works. It’s always the doctors disproving the cure-all, not the snake oil salesmen failing to prove their own prove their own product.
There is value in this research, even if it fits what you already believe on the subject. I would think you would be thrilled to have your hypothesis confirmed.
The sticky wicket is the proof that humans (functioning ‘normally’) do more than pattern.
I think if you look at child development research, you’ll see that kids can learn to do crazy shit with very little input, waaay less than you’d need to train a neural net to do the same. So either kids are the luckiest neural nets and always make the correct adjustment after failing, or they have some innate knowledge that isn’t pattern-based at all.
There’s even some examples in linguistics specifically, where children tend towards certain grammar rules despite all evidence in their language pointing to another rule. Pure pattern-matching would find the real-world rule without first modelling a different (universally common) rule.
While I hate LLMs with passion and my opinion of them boiling down to being glorified search engines and data scrapers, I would ask Apple: how sour are the grapes, eh?
edit: wording
Employers who are foaming at the mouth at the thought of replacing their workers with cheap AI:
🫢
Can’t really replace. At best, this tech will make employees more productive at the cost of the rainforests.
Yes but asshole employers haven’t realized this yet
Thank you Captain Obvious! Only those who think LLMs are like “little people in the computer” didn’t knew this already.
Yeah, well there are a ton of people literally falling into psychosis, led by LLMs. So it’s unfortunately not that many people that already knew it.
Dude they made chat gpt a little more boit licky and now many people are convinced they are literal messiahs. All it took for them was a chat bot and a few hours of talk.
So they have worked out that LLMs do what they were programmed to do in the way that they were programmed? Shocking.
The difference between reasoning models and normal models is reasoning models are two steps, to oversimplify it a little they prompt “how would you go about responding to this” then prompt “write the response”
It’s still predicting the most likely thing to come next, but the difference is that it gives the chance for the model to write the most likely instructions to follow for the task, then the most likely result of following the instructions - both of which are much more conformant to patterns than a single jump from prompt to response.
But it still manages to fuck it up.
I’ve been experimenting with using Claude’s Sonnet model in Copilot in agent mode for my job, and one of the things that’s become abundantly clear is that it has certain types of behavior that are heavily represented in the model, so it assumes you want that behavior even if you explicitly tell it you don’t.
Say you’re working in a yarn workspaces project, and you instruct Copilot to build and test a new dashboard using an instruction file. You’ll need to include explicit and repeated reminders all throughout the file to use yarn, not NPM, because even though yarn is very popular today, there are so many older examples of using NPM in its model that it’s just going to assume that’s what you actually want - thereby fucking up your codebase.
I’ve also had lots of cases where I tell it I don’t want it to edit any code, just to analyze and explain something that’s there and how to update it… and then I have to stop it from editing code anyway, because halfway through it forgot that I didn’t want edits, just explanations.
To be fair, the world of JavaScript is such a clusterfuck… Can you really blame the LLM for needing constant reminders about the specifics of your project?
When a programming language has five hundred bazillion absolutely terrible ways of accomplishing a given thing—and endless absolutely awful code examples on the Internet to “learn from”—you’re just asking for trouble. Not just from trying to get an LLM to produce what you want but also trying to get humans to do it.
This is why LLMs are so fucking good at writing rust and Python: There’s only so many ways to do a thing and the larger community pretty much always uses the same solutions.
JavaScript? How can it even keep up? You’re using yarn today but in a year you’ll probably like, “fuuuuck this code is garbage… I need to convert this all to [new thing].”
That’s only part of the problem. Yes, JavaScript is a fragmented clusterfuck. Typescript is leagues better, but by no means perfect. Still, that doesn’t explain why the LLM can’t recall that I’m using Yarn while it’s processing the instruction that specifically told it to use Yarn. Or why it tries to start editing code when I tell it not to. Those are still issues that aren’t specific to the language.
I’ve also had lots of cases where I tell it I don’t want it to edit any code, just to analyze and explain something that’s there and how to update it… and then I have to stop it from editing code anyway, because halfway through it forgot that I didn’t want edits, just explanations.
I find it hilarious that the only people these LLMs mimic are the incompetent ones. I had a coworker that changed things when asked to explain constantly.
You know, despite not really believing LLM “intelligence” works anywhere like real intelligence, I kind of thought maybe being good at recognizing patterns was a way to emulate it to a point…
But that study seems to prove they’re still not even good at that. At first I was wondering how hard the puzzles must have been, and then there’s a bit about LLM finishing 100 move towers of Hanoï (on which they were trained) and failing 4 move river crossings. Logically, those problems are very similar… Also, failing to apply a step-by-step solution they were given.
This paper doesn’t prove that LLMs aren’t good at pattern recognition, it demonstrates the limits of what pattern recognition alone can achieve, especially for compositional, symbolic reasoning.
Computers are awesome at “recognizing patterns” as long as the pattern is a statistical average of some possibly worthless data set. And it really helps if the computer is setup to ahead of time to recognize pre-determined patterns.
XD so, like a regular school/university student that just wants to get passing grades?