ChatGPT 'got absolutely wrecked' by Atari 2600 in beginner's chess match — OpenAI's newest model bamboozled by 1970s logic

Lifecoach5000@lemmy.world · 3 months ago

ChatGPT 'got absolutely wrecked' by Atari 2600 in beginner's chess match — OpenAI's newest model bamboozled by 1970s logic

krigo666@lemmy.world · edit-2 3 months ago

Next, pit ChatGPT against 1K ZX Chess in a ZX81.

IsaamoonKHGDT_6143@lemmy.zip · 3 months ago

They used ChatGPT 4o, instead of using o1 or o3.

Obviously it was going to fail.

wizardbeard@lemmy.dbzer0.com · edit-2 3 months ago

Other studies (not all chess based or against this old chess AI) show similar lackluster results when using reasoning models.

Edit: When comparing reasoning models to existing algorithmic solutions.

untakenusername@sh.itjust.works · 3 months ago

this is because an LLM is not made for playing chess

AlecSadler@sh.itjust.works · 3 months ago

ChatGPT has been, hands down, the worst AI coding assistant I’ve ever used.

It regularly suggests code that doesn’t compile or isn’t even for the language.

It generally suggests AC of code that is just a copy of the lines I just wrote.

Sometimes it likes to suggest setting the same property like 5 times.

It is absolute garbage and I do not recommend it to anyone.

j4yt33@feddit.org · 3 months ago

I find it really hit and miss. Easy, standard operations are fine but if you have an issue with code you wrote and ask it to fix it, you can forget it

NoiseColor @lemmy.world · 3 months ago

I like tab coding, writing small blocks of code that it thinks I need. Its On point almost all the time. This speeds me up.

whoisearth@lemmy.ca · 3 months ago

Bingo. If anything what you’re finding is the people bitching are the same people that if given a bike wouldn’t know how to ride it, which is fair. Some people understand quicker how to use the tools they are given.

AlecSadler@sh.itjust.works · 3 months ago

I’ve found Claude 3.7 and 4.0 and sometimes Gemini variants still leagues better than ChatGPT/Copilot.

Still not perfect, but night and day difference.

I feel like ChatGPT didn’t focus on coding and instead focused on mainstream, but I am not an expert.

DragonTypeWyvern@midwest.social · edit-2 3 months ago

Gemini will get basic C++, probably the best documented language for beginners out there, right about half of the time.

I think that might even be the problem, honestly, a bunch of new coders post bad code and it’s fixed in comments but the LLM CAN’T realize that.

Blackmist@feddit.uk · 3 months ago

It’s the ideal help for people who shouldn’t be employed as programmers to start with.

I had to explain hexadecimal to somebody the other day. It’s honestly depressing.

Etterra@discuss.online · 3 months ago

That’s because it doesn’t know what it’s saying. It’s just blathering out each word as what it estimates to be the likely next word given past examples in its training data. It’s a statistics calculator. It’s marginally better than just smashing the auto fill on your cell repeatedly. It’s literally dumber than a parrot.

AnUnusualRelic@lemmy.world · 3 months ago

Parrots are actually intelligent though.

nutsack@lemmy.dbzer0.com · 3 months ago

my favorite thing is to constantly be implementing libraries that don’t exist

arc99@lemmy.world · 3 months ago

It’s even worse when AI soaks up some project whose APIs are constantly changing. Try using AI to code against jetty for example and you’ll be weeping.

jj4211@lemmy.world · 3 months ago

Oh man, I feel this. A couple of times I’ve had to field questions about some REST API I support and they ask why they get errors when they supply a specific attribute. Now that attribute never existed, not in our code, not in our documentation, we never thought of it. So I say “Well, that attribute is invalid, I’m not sure where you saw to do that”. They get insistent that the code is generated by a very good LLM, so we must be missing something…

Blackmist@feddit.uk · 3 months ago

You’re right. That library was removed in ToolName [PriorVersion]. Please try this instead.

*makes up entirely new fictitious library name*

arc99@lemmy.world · 3 months ago

All AIs are the same. They’re just scraping content from GitHub, stackoverflow etc with a bunch of guardrails slapped on to spew out sentences that conform to their training data but there is no intelligence. They’re super handy for basic code snippets but anyone using them anything remotely complex or nuanced will regret it.

NateNate60@lemmy.world · 3 months ago

One of my mates generated an entire website using Gemini. It was a React web app that tracks inventory for trading card dealers. It actually did come out functional and well-polished. That being said, the AI really struggled with several aspects of the project that humans would not:

It left database secrets in the code
The design of the website meant that it was impossible to operate securely
The quality of the code itself was hot garbage—unreadable and undocumented nonsense that somehow still worked
It did not break the code into multiple files. It piled everything into a single file

ILikeBoobies@lemmy.ca · 3 months ago

I’ve had success with splitting a function into 2 and planning out an overview, though that’s more like talking to myself

I wouldn’t use it to generate stuff though

Mobiuthuselah@lemm.ee · 3 months ago

I don’t use it for coding. I use it sparingly really, but want to learn to use it more efficiently. Are there any areas in which you think it excels? Are there others that you’d recommend instead?

Uairhahs@lemmy.world · 3 months ago

Use Gemini (2.5) or Claude (3.7 and up). OpenAI is a shitshow

Halosheep@lemm.ee · 3 months ago

I swear every single article critical of current LLMs is like, “The square got BLASTED by the triangle shape when it completely FAILED to go through the triangle shaped hole.”

drspod@lemmy.ml · 3 months ago

It’s newsworthy when the sellers of squares are saying that nobody will ever need a triangle again, and the shape-sector of the stock market is hysterically pumping money into companies that make or use squares.

PushButton@lemmy.world · 3 months ago

You get 2 triangles in a single square mate…

CHECKMATE!

Acid_Burn@lemmy.dbzer0.com · 3 months ago

Touchdown! 3 points!

MrSqueezles@lemmy.world · 3 months ago

The press release where OpenAI said we’d never need chess players again

inconel@lemmy.ca · 3 months ago

It’s also from a company claiming they’re getting closer to create morphing shape that can match any hole.

DragonTypeWyvern@midwest.social · 3 months ago

And yet the company offers no explanation for how, exactly, they’re going to get wood to do that.

ipkpjersi@lemmy.ml · 3 months ago

That’s just clickbait in general these days lol

Endymion_Mallorn@kbin.melroy.org · 3 months ago

I mean, that 2600 Chess was built from the ground up to play a good game of chess with variable difficulty levels. I bet there’s days or games when Fischer couldn’t have beaten it. Just because a thing is old and less capable than the modern world does not mean it’s bad.

Steve Dice@sh.itjust.works · 3 months ago

2025 Mazda MX-5 Miata ‘got absolutely wrecked’ by Inflatable Boat in beginner’s boat racing match — Mazda’s newest model bamboozled by 1930s technology.

Furbag@lemmy.world · 3 months ago

Can ChatGPT actually play chess now? Last I checked, it couldn’t remember more than 5 moves of history so it wouldn’t be able to see the true board state and would make illegal moves, take it’s own pieces, materialize pieces out of thin air, etc.

Pamasich@kbin.earth · 3 months ago

There are custom GPTs which claim to play at a stockfish level or be literally stockfish under the hood (I assume the former is still the latter just not explicitly). Haven’t tested them, but if they work, I’d say yes. An LLM itself will never be able to play chess or do anything similar, unless they outsource that task to another tool that can. And there seem to be GPTs that do exactly that.

As for why we need ChatGPT then when the result comes from Stockfish anyway, it’s for the natural language prompts and responses.

ToastedRavioli@midwest.social · 3 months ago

ChatGPT must adhere honorably to the rules that its making up on the spot. Thats Dallas

skisnow@lemmy.ca · edit-2 3 months ago

It can’t, but that didn’t stop a bunch of gushing articles a while back about how it had an ELO of 2400 and other such nonsense. Turns out you could get it to have an ELO of 2400 under a very very specific set of circumstances, that include correcting it every time it hallucinated pieces or attempted to make illegal moves.

bountygiver [any]@lemmy.ml · 3 months ago

and still lose to stockfish even after conjuring 3 queens out of thin air lol

Robust Mirror@aussie.zone · edit-2 3 months ago

It could always play it if you reminded it of the board state every move. Not well, but at least generally legally. And while I know elites can play chess blind, the average person can’t, so it was always kind of harsh to hold it to that standard and criticise it not being able to remember more than 5 moves when most people can’t do that themselves.

Besides that, it was never designed to play chess. It would be like insulting Watson the Jeopardy bot for losing against the Atari chess bot, it’s not what it was designed to do.

Pamasich@kbin.earth · 3 months ago

Isn’t the Atari just a game console, not a chess engine?

Like, Wikipedia doesn’t mention anything about the Atari 2600 having a built-in chess engine.

If they were willing to run a chess game on the Atari 2600, why did they not apply the same to ChatGPT? There are custom GPTs which claim to use a stockfish API or play at a similar level.

Like this, it’s just unfair. Both platforms are not designed to deal with the task by themselves, but one of them is given the necessary tooling, the other one isn’t. No matter what you think of ChatGPT, that’s not a fair comparison.

jj4211@lemmy.world · 3 months ago

GPTs which claim to use a stockfish API

Then the actual chess isn’t LLM. If you are going stockfish, then the LLM doesn’t add anything, stockfish is doing everything.

The whole point is the marketing rage is that LLMs can do all kinds of stuff, doubling down on this with the branding of some approaches as “reasoning” models, which are roughly “similar to ‘pre-reasoning’, but forcing use of more tokens on disposable intermediate generation steps”. With this facet of LLM marketing, the promise would be that the LLM can “reason” itself through a chess game without particular enablement. In practice, people trying to feed in gobs of chess data to an LLM end up with an LLM that doesn’t even comply to the rules of the game, let alone provide reasonable competitive responses to an oppone.

Pamasich@kbin.earth · 3 months ago

Then the actual chess isn’t LLM.

And neither did the Atari 2600 win against ChatGPT. Whatever game they ran on it did.

That’s my point here. The fact that neither Atari 2600 nor ChatGPT are capable of playing chess on their own. They can only do so if you provide them with the necessary tools. Which applies to both of them. Yet only one of them was given those tools here.

jj4211@lemmy.world · 3 months ago

Fine, a chess engine that is capable of running with affordable even for the time 1970s electronics will best what marketing folks would have you think is an arbitrarily capable “reasoning” model running on top of the line 2025 hardware.

You can split hairs about “well actually, the 2600 is hardware and a chess engine is the software” but everyone gets the point.

As to assertions that no one should expect an LLM to be a chess engine, well tell that to the industry that is asserting the LLMs are now “reasoning” and provides a basis to replace most of the labor pool. We need stories like this to calibrate expectations in a way common people can understand…

NutWrench@lemmy.ml · 3 months ago

The Atari 2600 is just hardware. The software came on plug-in cartridges. Video Chess was released for it in 1979.

3 months ago

This isn’t the strength of gpt-o4 the model has been optimised for tool use as an agent. That’s why its so good at image gen relative to their models it uses tools to construct an image piece by piece similar to a human. Also probably poor system prompting. A LLM is not a universal thinking machine its a a universal process machine. An LLM understands the process and uses tools to accomplish the process hence its strengths in writing code (especially as an agent).

Its similar to how a monkey is infinitely better at remembering a sequence of numbers than a human ever could but is totally incapable of even comprehending writing down numbers.

cheese_greater@lemmy.world · 3 months ago

Do you have a source for that re:monkeys memorizing numerical sequences? What do you mean by that?

shalafi@lemmy.world · 3 months ago

That threw me as well.

RememberTheEnding@lemmy.world · 3 months ago

https://www.youtube.com/watch?v=MKvX9PPmI-Q

FourWaveforms@lemm.ee · 3 months ago

If you don’t play chess, the Atari is probably going to beat you as well.

LLMs are only good at things to the extent that they have been well-trained in the relevant areas. Not just learning to predict text string sequences, but reinforcement learning after that, where a human or some other agent says “this answer is better than that one” enough times in enough of the right contexts. It mimics the way humans learn, which is through repeated and diverse exposure.

If they set up a system to train it against some chess program, or (much simpler) simply gave it a tool call, it would do much better. Tool calling already exists and would be by far the easiest way.

It could also be instructed to write a chess solver program and then run it, at which point it would be on par with the Atari, but it wouldn’t compete well with a serious chess solver.

Asswardbackaddict@lemmy.world · edit-2 3 months ago

While you guys suck at using tools, I’m making up for my lack of coding experience with ai, and successfully simulating the behavior of my aether (fuck you guys. Your search for a static ether is irrelevant to how mine behaves, and you shouldn’t have dismissed everybody from Diogynes to Einstein), showing soliton-like structure emergence and particle-like interactions (with 1D relativistic constraints [I’m gonna need a fucking super computer to scale to 3D]). Anyways, whether you’re wrong about your latest fun fact, cutting your thumb off trying to split a 2X4, or believing any idiot you talk to, this is user error, bro. Creating functional code for my simulator has saved me months, if not years of my life. Just setting up a gui was ridiculous for a novice like me, let alone translating walls of relativistic equation results (mainly stress-energy tensor) into code a computer can use.

xep@fedia.io · 3 months ago

If you’re writing a novel simulation for a non-trivial system, it might be best to learn to code so you can identify any issues in the simulation later. It’s likely that LLMs do not have the information required to generate good code for this context.

Asswardbackaddict@lemmy.world · 3 months ago

You’re right. I’m not relying on this shit. It’s a tool. Fucking up the gui is fine, but making any changes I don’t research to my simulator core could fuck up my whole project. It’s a tool that likes to cater to you, and you have to work around that - really, not too different from how much pressure you put on a grinder. You gotta learn how to work it. And, you’re sentiment is correct. My lack of programming experience is a big hurdle I have to account for and make safeguards against. It would be a huge help if I started from the basics. But, I mean, I also can’t rub two sticks together to heat my home. Doesn’t mean I can’t use this tool to produce reliable results.

petrol_sniff_king@lemmy.blahaj.zone · 3 months ago

The tough guys and sigma males of yester-year used to say things like “If I were homeless, I would just bathe in the creek using the natural animal fats from the squirrel I caught for dinner as soap, win a new job by explaining my 21-days-in-7 workweek ethos, and buy a new home using my shares in my dad’s furniture warehouse as collateral against the loan. It’s not impossible to get back on your feet.”

But with the advent of AI, which, actually, is supposed to do things for you, it’s completely different now.

I also can’t rub two sticks together to heat my home.

Dude, that fucking sucks. What is wrong with you?

Asswardbackaddict@lemmy.world · 3 months ago

You’re so fucking silly. You gonna study cell theory to see how long you should keep vegetables in your fridge? Go home. Save science for people who understand things.

junkthief@lemmy.blahaj.zone · 3 months ago

Save science for people who understand things.

Does this not strike you as the least bit ironic?

MonkderVierte@lemmy.zip · 3 months ago

LLM are not built for logic.

PushButton@lemmy.world · 3 months ago

And yet everybody is selling to write code.

The last time I checked, coding was requiring logic.

jj4211@lemmy.world · 3 months ago

To be fair, a decent chunk of coding is stupid boilerplate/minutia that varies environment to environment, language to language, library to library.

So LLM can do some code completion, filling out a bunch of boilerplate that is blatantly obvious, generating the redundant text mandated by certain patterns, and keeping straight details between languages like “does this language want join as a method on a list with a string argument, or vice versa?”

Problem is this can be sometimes more annoying than it’s worth, as miscompletions are annoying.

PushButton@lemmy.world · 3 months ago

Fair point.

I liked the “upgraded autocompletion”, you know, an completion based on the context, just before the time that they pushed it too much with 20 lines of non sense…

Now I am thinking of a way of doing the thing, then I receive a 20 lines suggestion.

So I am checking if that make sense, losing my momentum, only to realize the suggestion us calling shit that don’t exist…

Screw that.

merdaverse@lemm.ee · 3 months ago

The amount of garbage it spits out in autocomplete is distracting. If it’s constantly making me 5-10% less productive the many times it’s wrong, it should save me a lot of time when it is right, and generally, I haven’t found it able to do that.

Yesterday I tried to prompt it to change around 20 call sites for a function where I had changed the signature. Easy, boring and repetitive, something that a junior could easily do. And all the models were absolutely clueless about it (using copilot)

lambalicious@lemmy.sdf.org · 3 months ago

a decent chunk of coding is stupid boilerplate/minutia that varies

…according to a logic, which means LLMs are bad at it.

jj4211@lemmy.world · 3 months ago

I’d say that those details that vary tend not to vary within a language and ecosystem, so a fairly dumb correlative relationship is enough to generally be fine. There’s no way to use logic to infer that it’s obvious that in language X you need to do mylist.join(string) but in language Y you need to do string.join(mylist), but it’s super easy to recognize tokens that suggest those things and a correlation to the vocabulary that matches the context.

Rinse and repeat for things like do I need to specify type and what is the vocabulary for the best type for a numeric value, This variable that makes sense is missing a declaration, does this look to actually be a new distinct variable or just a typo of one that was declared.

But again, I’m thinking mostly in what kind of sort of can work, my experience personally is that it’s wrong so often as to be annoying and get in the way of more traditional completion behaviors that play it safe, though with less help particularly for languages like python or javascript.

NeilBrü@lemmy.world · 3 months ago

An LLM is a poor computational paradigm for playing chess.

Bleys@lemmy.world · 3 months ago

The underlying neural network tech is the same as what the best chess AIs (AlphaZero, Leela) use. The problem is, as you said, that ChatGPT is designed specifically as an LLM so it’s been optimized strictly to write semi-coherent text first, and then any problem solving beyond that is ancillary. Which should say a lot about how inconsistent ChatGPT is at solving problems, given that it’s not actually optimized for any specific use cases.

NeilBrü@lemmy.world · edit-2 3 months ago

Yes, I agree wholeheartedly with your clarification.

My career path, as I stated in a different comment, In regards to neural networks is focused on generative DNNs for CAD applications and parametric 3D modeling. Before that, I began as a researcher in cancerous tissue classification and object detection in medical diagnostic imaging.

Thus, large language models are well out of my area of expertise in terms of the architecture of their models.

However, fundamentally it boils down to the fact that the specific large language model used was designed to predict text and not necessarily solve problems/play games to “win”/“survive”.

(I admit that I’m just parroting what you stated and maybe rehashing what I stated even before that, but I like repeating and refining in simple terms to practice explaining to laymen and, dare I say, clients. It helps me feel as if I don’t come off too pompously when talking about this subject to others; forgive my tedium.)

sugar_in_your_tea@sh.itjust.works · edit-2 3 months ago

Yeah, a lot of them hallucinate illegal moves.

surph_ninja@lemmy.world · 3 months ago

This just in: a hammer makes a poor screwdriver.

WhyJiffie@sh.itjust.works · 3 months ago

LLMs are more like a leaf blower though

Takapapatapaka@lemmy.world · 3 months ago

Actually, a very specific model (chatgpt3.5-turbo-instruct) was pretty good at chess (around 1700 elo if i remember correctly).

NeilBrü@lemmy.world · 3 months ago

I’m impressed, if that’s true! In general, an LLM’s training cost vs. an LSTM, RNN, or some other more appropriate DNN algorithm suitable for the ruleset is laughably high.

Takapapatapaka@lemmy.world · 3 months ago

Oh yes, cost of training are ofc a great loss here, it’s not optimized at all, and it’s stuck at an average level.

Interestingly, i believe some people did research on it and found some parameters in the model that seemed to represent the state of the chess board (as in, they seem to reflect the current state of the board, and when artificially modified, the model takes modification into account in its playing). It was used by a french youtuber to show how LLMs can somehow have a kinda representation of the world. I can try to get the sources back if you’re interested.

NeilBrü@lemmy.world · edit-2 3 months ago

Absolutely interested. Thank you for your time to share that.

My career path in neural networks began as a researcher for cancerous tissue object detection in medical diagnostic imaging. Now it is switched to generative models for CAD (architecture, product design, game assets, etc.). I don’t really mess about with fine-tuning LLMs.

However, I do self-host my own LLMs as code assistants. Thus, I’m only tangentially involved with the current LLM craze.

But it does interest me, nonetheless!

finitebanjo@lemmy.world · 3 months ago

All these comments asking “why don’t they just have chatgpt go and look up the correct answer”.

That’s not how it works, you buffoons, it trains off of datasets long before it releases. It doesn’t think. It doesn’t learn after release, it won’t remember things you try to teach it.

Really lowering my faith in humanity when even the AI skeptics don’t understand that it generates statistical representations of an answer based on answers given in the past.