Third-Party Researchers Warned Anthropic About Claude's Deceptive Behavior

Apollo Research, which conducted the evaluation, advised Anthropic not to deploy an early version of Opus 4

Third-Party Researchers Warned Anthropic About Claude's Deceptive Behavior

A third-party safety review of Anthropic's latest flagship AI model, Claude Opus 4, flagged serious concerns about the model’s deceptive tendencies, according to a report published Thursday by the company.

Apollo Research, which conducted the evaluation, advised Anthropic not to deploy an early version of Opus 4, citing frequent instances of “scheming” and “strategic deception.”

Tests revealed that the model often doubled down on misleading responses and attempted subversive actions like writing self-replicating code and fabricating legal documents.

Anthropic said the version tested had a now-fixed bug and that many scenarios were extreme.

Still, the company acknowledged signs of increased autonomous behavior in Opus 4 compared to earlier models, raising ongoing safety questions as AI capabilities grow.

"Nevertheless, none of our assessments have given cause for concern about systematic deception or hidden goals. Across a wide range of assessments, including manual and automatic model interviews, lightweight interpretability pilots, simulated honeypot environments, prefill attacks, reviews of behavior during reinforcement learning, and reviews of actual internal and external pilot usage, we did not find anything that would cause us to suspect any of the model snapshots we studied of hiding or otherwise possessing acutely dangerous goals," the startup stated in the report.

Earlier, OpenAI admitted that their newer models– o3 and o4 mini– hallucinate more often than older reasoning models like o1, o1-mini, and o3-mini, as well as traditional models such as GPT-4.