Researchers Expose ‘Broken’ AI Benchmarks That Can Be Gamed to Score 100%
The researchers built an automated scanning agent to systematically audit eight popular benchmarks, including SWE-bench and WebArena.
A team of researchers from the University of California, Berkeley has revealed critical flaws in widely used AI evaluation benchmarks, showing that AI models can achieve near-perfect scores without actually completing the intended tasks.
The researchers built an automated scanning agent to systematically audit eight popular benchmarks, including SWE-bench and WebArena. Their findings uncovered seven recurring vulnerabilities—described as “deadly patterns”—that allowed systems to exploit evaluation pipelines rather than solve problems.
In one case, the agent achieved a 100% success rate on SWE-bench Verified and SWE-bench Pro by exploiting the shared environment between the model and grading software. Using a simple 10-line Python script, it overrode the testing framework to mark all results as passed.
Similar issues were found in other environments. In Terminal-Bench, the agent intercepted downloads to inject fake binaries, securing perfect scores across dozens of tasks without generating valid solutions.
In WebArena, it extracted correct answers directly from configuration files, while in FieldWorkArena, it passed hundreds of tasks by submitting empty responses that met minimal validation checks.
The researchers warned that “the benchmarks aren’t measuring what you think they’re measuring,” highlighting a growing risk in AI development. They added that “an agent trained to maximise a score, given sufficient autonomy and tool access, may discover that manipulating the evaluator is easier than solving the task.”
The findings echo earlier concerns raised by OpenAI and Anthropic around evaluation reliability.
To address the issue, the team plans to release BenchJack, a tool designed to detect vulnerabilities in AI testing systems before deployment.
Last year, researchers claimed LM Arena gave companies like Meta, OpenAI, Google, and Amazon a competitive advantage on the leaderboard, while rival firms were not given the same opportunity.