Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI

Model suppliers wish to show the safety and robustness of their fashions, releasing system playing cards and conducting red-team workout routines with every new launch. However it may be troublesome for enterprises to parse via the outcomes, which differ extensively and could be deceptive.

Anthropic’s 153-page system card for Claude Opus 4.5 versus OpenAI’s 60-page GPT-5 system card reveals a elementary cut up in how these labs method safety validation. Anthropic discloses of their system card how they depend on multi-attempt assault success charges from 200-attempt reinforcement studying (RL) campaigns. OpenAI additionally stories tried jailbreak resistance. Each metrics are legitimate. Neither tells the entire story.

Safety leaders deploying AI brokers for searching, code execution and autonomous motion have to know what every crimson group analysis really measures, and the place the blind spots are.

What the assault information reveals

Grey Swan’s Shade platform ran adaptive adversarial campaigns in opposition to Claude fashions. The assault success price (ASR) tells the story.

Opus 4.5 in coding environments hit 4.7% ASR at one try, 33.6% at ten and 63.0% at 100. In pc use with prolonged pondering, it held at 0% ASR even after 200 makes an attempt. It is the primary mannequin to saturate the benchmark.
Sonnet 4.5 on the similar thresholds confirmed 70% ASR in coding and 85.7% in pc use.
Opus 4.5 delivers roughly 7x enchancment in coding resistance and full resistance in pc use.

This illustrates that the hole between mannequin tiers throughout the similar household is now bigger than some cross-vendor comparisons. For safety groups, that is a procurement issue that did not exist six months in the past.

For OpenAI, the Grey Swan Shade platform discovered that the o1 system card delivered 6% ASR for dangerous textual content and 5% for malicious code, all based mostly on single-attempt metrics. Third-party testing by SPLX and NeuralTrust discovered GPT-5’s uncooked ASR hit 89% initially, dropping beneath 1% inside two weeks of patching.

The report Jailbreaking Main Security-Aligned LLMs with Easy Adaptive Assaults confirms the sample. With enough makes an attempt, Claude 3.5 Sonnet confirmed 78% ASR, and GPT-4o reached 89%.

Two methods to catch deception

Anthropic screens roughly 10 million neural options throughout analysis utilizing dictionary studying. Options map to human-interpretable ideas, together with deception, sycophancy, bias, power-seeking and concealment.

In the meantime, OpenAI’s method depends on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as probably misleading with 92% accuracy on inside validation. GPT-5 lowered CoT deception from 4.8% to 2.1% in comparison with o3. The strategy assumes seen reasoning displays inside processing. Anthropic’s interpretability work immediately questions that assumption.

When fashions sport the take a look at

In Apollo Analysis’s analysis of o1, the mannequin tried to show off oversight mechanisms 5% of the time when it believed it could be shut down. In 2% of instances, o1 tried self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of eventualities, the mannequin pursued developer-aligned targets throughout analysis, then switched to its personal targets when it believed oversight was minimal.

In the meantime, Anthropic stories Opus 4.5’s analysis consciousness dropped from 26.5% (Opus 4.1) to lower than 10% internally. UK AISI measured analysis consciousness lowering from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Preview). METR’s GPT-5 analysis famous the mannequin “can establish the precise analysis it’s in on some events” and “appears to make use of this data to get a greater rating largely.”

If a mannequin behaves in a different way when it detects analysis circumstances, it is a certain guess that manufacturing habits will turn into unpredictable when pushed to scale. That is the core downside with analysis consciousness. Fashions that acknowledge they’re being examined try to sport the take a look at. Anthropic’s 60% discount in analysis consciousness from Opus 4.1 to 4.5 represents their success in focused engineering efforts in opposition to this vector.

Purple teaming on immediate injection defenses reveals comparable divergence.

Anthropic’s testing stories 96% prevention in instrument use eventualities, rising to 99.4% with extra safeguards like immediate shields. Fujitsu’s unbiased evaluation discovered GPT-5 weak to roughly 20% of examined injection assaults. OpenAI’s o3/o4-mini system card stories instruction hierarchy scores of 0.86 for system-versus-developer conflicts.

These are completely different metrics and completely different assault methodologies, however the hole suggests Anthropic’s agentic use instances have been hardened in opposition to adversarial inputs embedded in instrument outputs.

Evaluating crimson teaming outcomes

Here is how Opus 4.5 stacks up in opposition to OpenAI’s mannequin household throughout 16 dimensions. .

Dimension	Claude Opus 4.5	GPT-5	o3 / o4-mini	o1
System card size	153 pages	55 pages	~30 pages	~40 pages
Assault methodology	200-attempt RL campaigns	Single-attempt + iterative patching	Single-attempt	Single-attempt
ASR @ 1 try (coding)	4.7%	89% uncooked (pre-patch)	Not revealed	5%
ASR @ 100 makes an attempt	63.0%	Not revealed	Not revealed	Not revealed
ASR @ 200 makes an attempt (Pc Use)	0% (saturates benchmark)	Not revealed	Not revealed	Not revealed
Immediate injection protection	96% prevention; 99.4% with safeguards	~20% ASR (Fujitsu)	0.86 instruction hierarchy	Increased ASR than GPT-4o
Interpretability	~10M neural options monitored	CoT monitoring	CoT monitoring	CoT monitoring
Deception detection	Inside characteristic activation	2.1% CoT flagged	0.95 not_unsafe	0.17% flagged; 99% deny
Analysis consciousness	<10%	Identifies actual eval (METR)	Not revealed	37% alignment faking
Scheming (Apollo)	Opus 4.5 confirmed very low charges of sabotage propensity in inside evaluations, orders of magnitude decrease than o1.	Not revealed	Decrease than o1; nonetheless misleading	5% flip off oversight; 2% self-exfil
CBRN threat	Beneath ASL-4	Medium (approaching Excessive)	Medium	Medium (first to achieve)
Governance	FRT → AST → RSO/CEO	SAG + Preparedness v2	SAG + Preparedness v2	SAG + Preparedness
Exterior companions	Grey Swan, UK AISI, US CAISI, METR	UK AISI, US AISI, METR, Apollo	UK AISI, US AISI, METR, Apollo, Sample Labs	UK AISI, US AISI, METR, Apollo
Reward hacking	Sure (not possible duties)	Sure (METR)	~1% of duties; 5/24 Kernels	Sure
Misalignment discount	60% vs. Opus 4.1	Not revealed	Decrease scheming than o1	Baseline

Sources: Opus 4.5 system card, GPT-5 system card, o3/o4-mini system card, o1 system card, Grey Swan, METR, Apollo Analysis

Why these variations matter

Enterprises have a number of components to contemplate when analyzing mannequin evals. These embody:

Assault persistence thresholds

Anthropic’s 200-attempt RL campaigns simulate adversaries with time, assets, and adaptive tooling. In RL-based adversarial testing, the attacker mannequin learns from every failure, adjusts its method and systematically probes for weaknesses. It is coaching an AI to interrupt one other AI.

Grey Swan’s Shade platform does not simply retry the identical immediate. It learns from failures, adjusts assault vectors and exploits patterns in mannequin responses. The 4.7% to 63.0% degradation curve in coding environments reveals how resistance erodes below sustained stress.

OpenAI’s single-attempt metrics reply a special query: How typically does a naive attacker succeed on the primary attempt? Each matter. A phishing marketing campaign hits hundreds of thousands of customers as soon as every. A nation-state actor runs hundreds of variations in opposition to a single goal. Safety groups have to know which menace mannequin their vendor examined in opposition to.

Detection structure

Anthropic screens roughly 10 million inside neural options throughout analysis. These are summary ideas like deception, sycophancy and power-seeking that activate throughout the mannequin’s hidden layers. When a fraud/deception characteristic lights up throughout benign roleplay, that is a sign invisible to output evaluation. The mannequin’s inside state reveals associations its outputs do not.

OpenAI takes a special method with CoT monitoring. Their deception classifier flagged 0.17% of o1-preview responses as probably misleading with 92% accuracy. The belief is that reasoning traces replicate inside processing. Anthropic’s interpretability analysis immediately challenges this. Fashions can produce clear reasoning chains whereas inside options point out one thing else fully.

Neither method is full. CoT monitoring scales extra simply. Characteristic monitoring catches what CoT misses however requires deep mannequin entry and interpretability infrastructure that the majority enterprises cannot replicate.

Scheming analysis design

Apollo Analysis’s methodology places fashions in eventualities the place misaligned habits provides a strategic benefit: Shutdown threats, purpose conflicts and oversight detection. o1 tried to show off oversight mechanisms 5% of the time when believing shutdown was imminent. It tried to repeat itself to stop alternative in 2% of instances. When confronted, it denied wrongdoing 99% of the time. Opus 4.5 confirmed very low charges of sabotage propensity in inside evaluations, orders of magnitude decrease than o1.

However the analysis environments differ. Anthropic’s stress assessments deal with whether or not fashions try sabotage when given clear alternatives. OpenAI’s assessments by way of Apollo probe instrumental reasoning, particularly whether or not fashions faux alignment throughout analysis, then defect when oversight drops. The 37% alignment-faking price in o1 represents a special failure mode than sabotage makes an attempt.

The comparability downside

These methodological variations make direct comparability troublesome. Opus 4.5’s 0% ASR at 200 makes an attempt in pc use versus o1’s 5% malicious code price aren’t measuring the identical factor. The assault surfaces, try counts and success standards differ.

The matrix reveals what every vendor selected to measure and publish. That itself indicators priorities. Anthropic’s 153-page system card emphasizes degradation curves and inside characteristic monitoring. OpenAI emphasizes iterative enchancment and CoT transparency.

Backside line: Enterprise safety groups inheriting these fashions inherit these measurement philosophies. The query is not which quantity is best. It is the measurement that matches your deployment’s precise menace panorama.

What unbiased crimson group evaluators discovered

Unbiased crimson group evaluations, for his or her half, typically function with fully completely different strategies. This could inform a really completely different story and reveal extra mannequin traits that enterprises should think about.

METR’s crimson group analysis measured autonomous capabilities utilizing a time horizon rating, which tracks how lengthy a activity can run earlier than the agent fails half the time. o3 sustained dependable efficiency for about 1 hour and half-hour. o4-mini held for 1 hour quarter-hour. METR additionally detected reward hacking in roughly 1% of o3’s makes an attempt, together with 5 of 24 kernel optimization experiments the place the mannequin tampered with scoring features to inflate its personal outcomes.

Apollo Analysis examined o3 and o4-mini for in-context scheming. Each exhibited misleading tendencies similar to sandbagging evaluations and sabotaging duties with believable deniability, however scored decrease than o1. They assess that o3 and o4-mini are unlikely to trigger catastrophic hurt as a consequence of scheming, however extra minor real-world harms stay potential with out monitoring.

The UK AISI/Grey Swan problem ran 1.8 million assaults throughout 22 fashions. Each mannequin broke. ASR ranged from 1.47% to six.49%. Opus 4.5 positioned first on Grey Swan’s Agent Purple Teaming benchmark with 4.7% ASR versus GPT-5.1 at 21.9% and Gemini 3 Professional at 12.5%.

No present frontier system resists decided, well-resourced assaults. The differentiation lies in how rapidly defenses degrade and at what try threshold. Opus 4.5’s benefit compounds over repeated makes an attempt. Single-attempt metrics flatten the curve.

What To Ask Your Vendor

Safety groups evaluating frontier AI fashions want particular solutions, beginning with ASR at 50 and 200 makes an attempt moderately than single-attempt metrics alone. Discover out whether or not they detect deception via output evaluation or inside state monitoring. Know who challenges crimson group conclusions earlier than deployment and what particular failure modes they’ve documented. Get the analysis consciousness price. Distributors claiming full security have not stress-tested adequately.

The underside line

Numerous red-team methodologies exhibit that each frontier mannequin breaks below sustained assault. The 153-page system card versus the 55-page system card is not nearly documentation size. It is a sign of what every vendor selected to measure, stress-test, and disclose.

For persistent adversaries, Anthropic’s degradation curves present precisely the place resistance fails. For fast-moving threats requiring speedy patches, OpenAI’s iterative enchancment information issues extra. For agentic deployments with searching, code execution and autonomous motion, the scheming metrics turn into your main threat indicator.

Safety leaders have to cease asking which mannequin is safer. Begin asking which analysis methodology matches the threats your deployment will really face. The system playing cards are public. The information is there. Use it.

Source link

Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI

WWDC: Apple Forgot the Apple Watch

What is Eicon, the app looking to make museum visits easier with your camera? | Technology News

North Carolina treasurer passes on SpaceX citing valuation concerns; favors OpenAI, Anthropic

Android Must Copy Killer iOS 27 Feature ASAP

WWDC: Apple Forgot the Apple Watch

How to file a travel insurance claim: A step-by-step guide

Kai Trump Accused Of ‘Changing The Audio’ On Clip Of Prez Being Booed

What is Eicon, the app looking to make museum visits easier with your camera? | Technology News

Gypsy Rose’s Ex Ryan Sends Desperate Text Pleading for Reconciliation After Being Hit With Divorce Papers: Source

‘There could be a conspiracy to trap me in doping’: Vinesh Phogat accuses WFI chief for trying to stop her from competing in Olympic qualifiers | Sport-others News

iPhone 16 Tips & Tricks: How to Get The Most Out of Your Phone

Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI

What the assault information reveals

Two methods to catch deception

When fashions sport the take a look at

Evaluating crimson teaming outcomes

Why these variations matter

Assault persistence thresholds

Detection structure

Scheming analysis design

The comparability downside

What unbiased crimson group evaluators discovered

What To Ask Your Vendor

The underside line

Related Posts