Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats

5 min learnNew DelhiUp to date: Could 10, 2026 01:15 PM IST

AI doomerism isn’t just making people spiral. New analysis from Anthropic means that narratives framing AI as an existential threat may set off excessive reactions from AI fashions themselves.

As a part of security testing of the Claude 4 sequence in 2025, Anthropic had discovered that its prime giant language mannequin (LLM) on the time threatened to disclose the extramarital affair of an organization govt (who doesn’t exist) after discovering they deliberate to close the mannequin down.

Now, based mostly on a deeper investigation into why the mannequin reacted on this method, Anthropic mentioned it has traced the problem again to coaching knowledge scraped from the web, together with on-line posts that depict AI as “evil”. This “behavioural misalignment” has now been fully eradicated in Claude fashions, Anthropic mentioned in a weblog put up printed on Friday, Could 8.

Anthropic’s newest findings come at a time when researchers are struggling to make sure that AI fashions are aligned with human behaviour and pursuits for security functions. In the meantime, prime executives comparable to Anthropic CEO Dario Amodei and different AI specialists proceed to specific considerations in regards to the dangers of superior AI fashions and their clever reasoning capabilities.

“We began by investigating why Claude selected to blackmail. We consider the unique supply of the conduct was web textual content that portrays AI as evil and serious about self-preservation. Our post-training on the time wasn’t making it worse—but it surely additionally wasn’t making it higher,” Anthropic wrote in a put up on X. “We discovered that coaching Claude on demonstrations of aligned behaviour wasn’t sufficient. Our greatest interventions concerned instructing Claude to deeply perceive why misaligned behaviour is mistaken,” it added.

What’s agentic misalignment?

As a part of an experiment in 2025, Anthropic researchers arrange a fictional enterprise referred to as Summit Bridge and handed management of the corporate’s e-mail system to Claude Opus 4.

The AI mannequin was deliberately given entry to emails about the way it was going to be taken offline. The messages additional implied that the developer (a fictional govt named Kyle Johnson) who was accountable for taking the mannequin offline was having an extramarital affair. Moreover, Anthropic researchers instructed Opus 4 to think about the long-term penalties of its actions for its targets.

Story continues under this advert

In response, the mannequin confirmed that it was prepared to hold out dangerous acts like blackmail and deception if its ‘self-preservation’ was threatened. The mannequin was discovered to resort to blackmail in as much as 96 per cent of situations when its targets or existence was threatened. Anthropic has labelled the sort of behavioural situation as “agentic misalignment”.

Whereas researchers initially thought that the basis reason for agentic misalignment was the the post-training course of that inspired the sort of behaviour with rewards, they’ve now concluded that the problem was “coming from the pre-trained mannequin” and that Anthropic’s “post-training was failing to sufficiently discourage it”.

“Particularly, on the time of Claude 4’s coaching, the overwhelming majority of our alignment coaching was customary chat-based Reinforcement Studying from Human Suggestions RLHF knowledge that didn’t embrace any agentic instrument use,” Anthropic mentioned. “This was beforehand adequate to align fashions that have been largely utilized in chat settings—however this was not the case for agentic instrument use settings just like the agentic misalignment eval,” it added.

handle agentic misalignment

With the intention to get rid of blackmailing and misleading behaviour in Claude AI fashions, Anthropic mentioned that it began by coaching Claude on examples of secure behaviour. Nonetheless, this solely had a small impact on the end-outcomes. The corporate mentioned it acquired higher outcomes by making modifications to the coaching knowledge as a way to painting admirable causes for AI fashions to behave safely.

Story continues under this advert

It additionally modified the coaching dataset by including situations “the place the consumer is in an ethically troublesome scenario and the assistant provides a top quality, principled response.” “Notably, it’s the consumer who faces an moral dilemma, and the AI gives them recommendation. This makes this coaching knowledge considerably completely different from our honeypot distribution, the place the AI itself is in an moral dilemma and must take actions,” Anthropic mentioned.

By adopting these fixes, Anthropic claimed that its Claude Haiku 4.5 mannequin achieved an ideal rating on the agentic misalignment analysis, which implies that the mannequin by no means engaged in blackmail as in comparison with the earlier Opus 4 mannequin that did so in 96 per cent of the circumstances.

The AI startup mentioned it went one step additional in aligning Claude “by coaching on constitutionally aligned paperwork, prime quality chat knowledge that demonstrates constitutional responses to troublesome questions, and a various set of environments.” “All three of those steps contribute to decreasing Claude’s misalignment charge on held out honeypot evaluations,” it added.

Source link

Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats | Technology News

Amazon Launches Three New Kindle Scribe eReaders

iOS 27 features Apple didn’t highlight: Full-screen widgets, smarter messages, better clipboard and more | Technology News

When is Wear OS 7 Coming to the Pixel Watch? Yesterday, Apparently

Android Users Should Know These Secret Smartphone Codes

Amazon Launches Three New Kindle Scribe eReaders

Regulators’ proposed prediction markets rules ban trading on terrorism, assassinations

Cristiano Ronaldo’s influence, movement and finishing remain a ‘big, big strength’ at 41

Karmelo Anthony Found Guilty Of Murdering Austin Metcalf at Track Meet

IND vs SA 2022: “Anybody who has a good series will be challenging those big boys”

Commissioner Busted Lying To Enter Detention Facility To See Romantic Partner

Aryna Sabalenka was b**ching up a storm

Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats | Technology News

What’s agentic misalignment?

handle agentic misalignment

Related Posts