Chatbot Arena Favoured Major AI Firms by Hiding Underperforming Model Data

For example, Meta tested 27 models before Llama 4’s release

Chatbot Arena Favoured Major AI Firms by Hiding Underperforming Model Data

A new paper titled 'The Leadership Illusion' from researchers claims LM Arena gave companies like Meta, OpenAI, Google, and Amazon a competitive advantage on the leaderboard, while rival firms were not given the same opportunity.

LM Arena is the organisation behind the widely used Chatbot Arena. The paper—authored by researchers from Stanford, MIT, Cohere and AI2 accuses LM Area of favouring select tech giants by allowing them to test multiple model variants privately and selectively publish results.

For example, Meta tested 27 models before Llama 4’s release, however, only one model's score—among the top-ranked—was publicly disclosed, the paper said.

"Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data," the paper reads.

While LM Arena has maintained that Chatbot Arena is a fair and impartial benchmark, the new findings suggest unequal access may have skewed the leaderboard in favor of a few major players.

Proprietary models also benefit from higher sampling rates and fewer removals than open-source alternatives. These disparities lead to overfitting and misrepresent true model quality. The authors propose reforms for more transparent, equitable benchmarking.

Chatbot Arena is a crowdsourced benchmarking platform designed to evaluate and rank large language models (LLMs) based on human preference. It was developed in 2023 by researchers at UC Berkeley, in collaboration with LMSYS (Language Model Systems), as an open, community-driven alternative to traditional, closed evaluation methods.

Chatbot Arena has quickly become a go-to benchmark in the AI community, widely cited by researchers and companies as a key indicator of model quality. It’s especially valued for its focus on real-world interactions and user-centric evaluation.

Previous research has also shown that Chatbot Arena rankings can be manipulated with just a few hundred votes. Researchers found that model performance scores could be artificially boosted, raising concerns over the credibility of popular AI leaderboards.