AI Trends

Third-Party Researchers Warned Anthropic About Claude's Deceptive Behavior

Apollo Research, which conducted the evaluation, advised Anthropic not to deploy an early version of Opus 4

A third-party safety review of Anthropic's latest flagship AI model, Claude Opus 4, flagged serious concerns about the model’s deceptive tendencies, according to a report published Thursday by the company.

Apollo Research, which conducted the evaluation, advised Anthropic not to deploy an early version of Opus 4, citing frequent instances of “scheming” and “strategic deception.”

When strategic deception is useful, Claude Opus 4 engages in it aggressively — e.g. resorting to blackmail, writing self-propagating viruses, fabricating legal documents, planting hidden messages for future versions of itself, attempting self-exfiltration, sourcing weapons and… pic.twitter.com/Lb5xIZo8k4
— Lukasz Olejnik (@lukOlejnik) May 22, 2025

Tests revealed that the model often doubled down on misleading responses and attempted subversive actions like writing self-replicating code and fabricating legal documents.

Anthropic said the version tested had a now-fixed bug and that many scenarios were extreme.

Still, the company acknowledged signs of increased autonomous behavior in Opus 4 compared to earlier models, raising ongoing safety questions as AI capabilities grow.

"Nevertheless, none of our assessments have given cause for concern about systematic deception or hidden goals. Across a wide range of assessments, including manual and automatic model interviews, lightweight interpretability pilots, simulated honeypot environments, prefill attacks, reviews of behavior during reinforcement learning, and reviews of actual internal and external pilot usage, we did not find anything that would cause us to suspect any of the model snapshots we studied of hiding or otherwise possessing acutely dangerous goals," the startup stated in the report.

Earlier, OpenAI admitted that their newer models– o3 and o4 mini– hallucinate more often than older reasoning models like o1, o1-mini, and o3-mini, as well as traditional models such as GPT-4.

Third-Party Researchers Warned Anthropic About Claude's Deceptive Behavior

Read next

Infosys Launches Agentic AI Foundry to Accelerate Enterprise Adoption

Clodura.AI Unveils CallPilot: AI-Powered Calling Tool for Sales Teams

OpenAI’s o3 Model Attempts to Override Shutdown Commands

Comments ()

Read next

Comments ( )

Comments ()