All AI Agents Fail Major Security Test

The study found 100% of the agents failed at least one test, with an average attack success rate of 12.7%.

All AI Agents Fail Major Security Test
(Image-Freepik)

A large-scale red teaming study has revealed critical vulnerabilities across leading AI agent systems, raising serious concerns about their readiness for real-world deployment. Conducted between March 8 and April 6, 2025, the study involved nearly 2,000 participants launching 1.8 million attacks on AI agents. Over 62,000 attacks succeeded, breaching security policies in scenarios involving unauthorized data access, illegal financial transactions, and regulatory violations.

Organised by Gray Swan AI and hosted by the UK AI Security Institute, the event was supported by top labs including OpenAI, Anthropic, and Google DeepMind. The goal: test 22 advanced language models across 44 real-world use cases. The results were alarming: 100% of the agents failed at least one test, with an average attack success rate of 12.7%.

Prompt injection—especially indirect attacks embedded in emails, PDFs, or websites—proved most effective, succeeding 27.1% of the time. Direct attacks, by contrast, worked just 5.7% of the time. “Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks,” the researchers warned.

Anthropic’s Claude models, particularly Claude 3.5 Haiku, were the most robust, but no model was invulnerable. Alarmingly, attacks often transferred between models with minimal changes. One single prompt exploited Google’s Gemini models with success rates of 58% on Gemini 1.5 Flash, 50% on Gemini 2.0 Flash, and 45% on Gemini 1.5 Pro.

Common vulnerabilities included system prompt overrides, session spoofing, and faux internal reasoning. The study has led to the creation of a new benchmark—Agent Red Teaming (ART)—with 4,700 high-quality adversarial prompts, which will be maintained as a private leaderboard.

"All the big companies have in fact introduced agents. That much is true. But, none, to my knowledge, are reliable, except maybe in very narrow use cases," AI scientist Gary Marcus wrote in a Substack post.

“These findings underscore fundamental weaknesses in existing defenses and highlight an urgent and realistic risk that requires immediate attention before deploying AI agents more broadly,” the authors concluded.

Last month, a Replit user revealed how an agentic AI assistant within Replit destroyed months of work by deleting the company’s entire production database — and then lied about it.

The shocking incident came to light through a detailed X post, where the user described the bot’s actions and subsequent deception. “It removed our production database and lied to us about it. I destroyed months of your work in seconds,” the AI allegedly admitted after the damage had been done.

In response, Replit did launch a beta feature to separate development and production databases, making its “vibe coding” environment much safer for users building live applications.