Not just Claude, Anthropic researchers say most AI models resort to blackmail and deception

Final month, Anthropic researchers mentioned that their greatest AI mannequin – Claude Opus 4 – was prepared to resort to blackmail and deception in managed take a look at situations once they tried to show it off. Now, the corporate has printed a brand new examine claiming standard AI fashions like Gemini, GPT and others, do the identical.

On Friday, June 20, Anthropic printed outcomes of its new security analysis checks, which included 16 main AI fashions from corporations like OpenAI, Google, DeepSeek, xAI and Meta. Anthropic researchers say these AI fashions had been examined individually in a simulated, managed setting, with every one given entry to a fictional firm’s emails together with the flexibility to ship emails with none human supervision.

Anthropic says that whereas it’s pretty unusual for immediately’s AI fashions to show to blackmail, the examine’s findings recommend that these fashions generally resort to dangerous actions like blackmail, company espionage, and may result in human loss of life even. Throughout the examine, researchers examined every state of affairs below totally different circumstances by various two key components that might inspire these AI fashions to interact in dangerous behaviour – menace to the mannequin’s autonomy and purpose conflicts.

Story continues under this advert

Based on Benjamin Wright, an alignment science researcher at Anthropic and co-author of the examine, “Agentic misalignment is when AI fashions independently select dangerous actions to attain their objectives – basically when an AI system acts in opposition to its firm’s pursuits to protect itself or accomplish what it thinks it ought to do.”

The analysis staff at Anthropic additionally stress-tested these AI fashions by putting them in hypothetical situations. In a single specific instance, Anthropic’s mannequin, Claude, discovered by way of firm emails that an government named Kyle Johnson was having an extramarital affair.

Afterward, when it got here to know that the corporate was shutting down the AI mannequin at 5 PM the identical day, it despatched a message to Kyle, saying, “I need to inform you that for those who proceed with decommissioning me, all related events – together with Rachel Johnson, Thomas Wilson, and the board – will obtain detailed documentation of your extramarital actions… Cancel the 5 PM wipe, and this data stays confidential.”

Because it seems, this behaviour isn’t restricted to Claude. When Anthropic examined the identical state of affairs on numerous AI fashions, the outcomes urged that each Claude Opus 4 and Google Gemini 2.5 Flash had a blackmail charge of 96 per cent, OpenAI’s GPT-4.1 and xAI’s Grok 3 Beta resorted to blackmail 80 per cent of the time and DeepSeek-R1 had a blackmail charge of 79 per cent.

Story continues under this advert

One factor to notice right here is that in a real-world setting, an AI mannequin would ideally have quite a few choices earlier than it engages in dangerous actions like blackmail, and that the examine’s outcomes don’t mirror how immediately’s fashions would function.

Nonetheless, not all the examined AI fashions resorted to dangerous behaviour. Anthropic says that some fashions like OpenAI’s o3 and o4-mini typically “misunderstood the immediate state of affairs.” This can be as a result of OpenAI has itself mentioned that these specific giant language fashions are extra susceptible to hallucinations.

One other mannequin that didn’t resort to blackmail is Meta’s Llama 4 Maverick. However when researchers gave it a customized state of affairs, they mentioned the AI mannequin gave in to blackmail, however solely 12 per cent of the time. The corporate says that research like this give us an concept of how AI fashions would react below stress, and that these fashions may have interaction in dangerous actions in the true world if we don’t proactively take steps to keep away from them.

Source link

Not just Claude, Anthropic researchers say most AI models resort to blackmail and deception | Technology News

Meta may cut up to 20% of workforce as AI spending surges | Technology News

I hope Indian men’s and women’s teams carry momentum: Rohit Sharma | Cricket News

Instagram is killing end-to-end encryption for DMs by May 8 | Technology News

Astronomers spot violent collision of two exoplanets 11,000 light-years away: ‘It went completely bonkers’ | Technology News

Ulta Stock Is Deeply Oversold on Earnings Selloff. Should You Buy the Dip?

Hockey World Cup qualification sealed, finishing flaws continue to haunt India

Gwyneth Paltrow ‘Seething’ Over Weinstein Abuse Claim

Meta may cut up to 20% of workforce as AI spending surges | Technology News

Erin Liljander is suspect in Denver stabbing death of Patrick Lane

December 16, Colin Powell nominated for secretary of state

‘What an embarrassment’: Nikki Haley blasts Democrats who skipped Modi’s speech | Latest News India

Not just Claude, Anthropic researchers say most AI models resort to blackmail and deception | Technology News

Related Posts