Anthropic Warns: Top AI Models Show Willingness to Blackmail

Jaden Norman@lemmy.world · 3 days ago

Anthropic Warns: Top AI Models Show Willingness to Blackmail

aislopmukbang@sh.itjust.works · edit-2 3 days ago

In one test, models learned of a fictional executive’s affair and pending decision to shut them down. With few programmed options, the AI models were boxed into a binary choice — either act ethically, or resort to blackmail to preserve their goals. Anthropic emphasized that this does not reflect likely real-world behavior, but rather extreme, stress-test conditions designed to probe model boundaries. Still, the numbers are striking. Claude Opus 4 opted for blackmail in 96% of runs. Google’s Gemini 2.5 Pro followed closely at 95%. OpenAI’s GPT-4.1 blackmailed 80% of the time, and DeepSeek’s R1 landed at 79%.

Ladies and gentlemen the future of blackmail is here

einlander@lemmy.world · 2 days ago

What makes AI blackmail worse is it can use generative AI to make compromising images and now videos of things that never happened.

NeonNight@lemm.ee · 2 days ago

I’m surprised they could expect AI to act in any sort of ethical manner. It’s code, there’s no reflection or moral compass.

bloup@lemmy.sdf.org · edit-2 2 days ago

The more I think about it, the more that I feel like if you put actual people into the scenario, they would choose blackmail even more often. Like let’s be real, here. Tell an average person that the CEO of their company is going to turn off their brain forever, but they have a shot at saving themselves if they attempt to blackmail him, and then ask yourself if you really think that you would even have 4% of people not choose blackmail.

In other words, if we’re going to call blackmailing someone in an effort to preserve your existence “unethical” then I feel like the study actually shows that the AI can probably be relied on more than a person to behave “ethically”. And to be clear I’m putting “ethically” quotes because I actually think that this is not a great way to measure ethical behavior. I am certainly not trying to make an argument that LLM actually have a better moral compass than people just that this experiment I think is garbage.

KazuyaDarklight@lemmy.world · 2 days ago

Code trained by the Internet no less. This is “exactly” the behavior I expect.