Are the vibes off?
New study accuses LM Arena of gaming its popular AI benchmark
The popular AI vibe test may not be as fair as it seems.
Ryan Whitwam
–
May 1, 2025 4:31 pm
|
3
Credit:
Carol Yepes via Getty
Credit:
Carol Yepes via Getty
Text
settings
Story text
Size
Small
Standard
Large
Width
*
Standard
Wide
Links
Standard
Orange
* Subscribers only
Learn more
Minimize to nav
The rapid proliferation of AI chatbots has made it difficult to know which models are actually improving and which are falling behind. Traditional academic benchmarks only tell you so much, which has led many to lean on vibes-based analysis from LM Arena. However, a new study claims this popular AI ranking platform is rife with unfair practices, favoring large companies that just so happen to rank near the top of the index. The site's operators, however, say the study draws the wrong conclusions.
LM Arena was created in 2023 as a research project at the University of California, Berkeley. The pitch is simple—users feed a prompt into two unidentified AI models in the "Chatbot Arena" and evaluate the outputs to vote on the one they like more. This data is aggregated in the LM Arena leaderboard that shows which models people like the most, which can help track improvements in AI models.
Companies are paying more attention to this ranking as the AI market heats up. Google noted when it released Gemini 2.5 Pro that the model debuted at the top of the LM Arena leaderboard, where it remains to this day. Meanwhile, DeepSeek's strong performance in the Chatbot Arena earlier this year helped to catapult it to the upper echelons of the LLM race.
The researchers, hailing from Cohere Labs, Princeton, and MIT, believe AI developers may have placed too much stock in LM Arena. The new study, available on the preprint arXiv server, claims the arena rankings are distorted by practices that make it easier for proprietary chatbots to outperform open ones. The authors say LM Arena allows developers of proprietary large language models (LLMs) to test multiple versions of their AI on the platform. However, only the highest performing one is added to the public leaderboard.
Meta tested 27 versions of Llama-4 before releasing the version that appeared on the leaderboard.
Credit:
Shivalika Singh et al.
Meta tested 27 versions of Llama-4 before releasing the version that appeared on the leaderboard.
Credit:
Shivalika Singh et al.
Some AI developers are taking extreme advantage of the private testing option. The study reports that Meta tested a whopping 27 private variants of Llama-4 before release. Google is also a beneficiary of LM Arena's private testing system, having tested 10 variants of Gemini and Gemma between January and March 2025.
This study also calls out LM Arena for what appears to be much greater promotion of private models like Gemini, ChatGPT, and Claude. Developers collect data on model interactions from the Chatbot Arena API, but teams focusing on open models consistently get the short end of the stick.
The researchers point out that certain models appear in arena faceoffs much more often, with Google and OpenAI together accounting for over 34 percent of collected model data. Firms like xAI, Meta, and Amazon are also disproportionately represented in the arena. Therefore, those firms get more vibemarking data compared to the makers of open models.
More models, more evals
The study authors have a list of suggestions to make LM Arena more fair. Several of the paper's recommendations are aimed at correcting the imbalance of privately tested commercial models, for example, by limiting the number of models a group can add and retract before releasing one. The study also suggests showing all model results, even if they aren't final.
However, the site's operators take issue with some of the paper's methodology and conclusions. LM Arena points out that the pre-release testing features have not been kept secret, with a March 2024 blog post featuring a brief explanation of the system. They also contend that model creators don't technically choose the version that is shown. Instead, the site simply doesn't show non-public versions for simplicity's sake. When a developer releases the final version, that's what LM Arena adds to the leaderboard.
Proprietary models get disproportionate attention in the Chatbot Arena, the study says.
Credit:
Shivalika Singh et al.
Proprietary models get disproportionate attention in the Chatbot Arena, the study says.
Credit:
Shivalika Singh et al.
One place the two sides may find alignment is on the question of unequal matchups. The study authors call for fair sampling, which will ensure open models appear in Chatbot Arena at a rate similar to the likes of Gemini and ChatGPT. LM Arena has suggested it will work to make the sampling algorithm more varied so you don't always get the big commercial models. That would send more eval data to small players, giving them the chance to improve and challenge the big commercial models.
LM Arena recently announced it was forming a corporate entity to continue its work. With money on the table, the operators need to ensure Chatbot Arena continues figuring into the development of popular models. However, it's unclear whether this is an objectively better way to evaluate chatbots versus academic tests. As people vote on vibes, there's a real possibility we are pushing models to adopt sycophantic tendencies. This may have helped nudge ChatGPT into suck-up territory in recent weeks, a move that OpenAI has hastily reverted after widespread anger.
Ryan Whitwam
Senior Technology Reporter
Ryan Whitwam
Senior Technology Reporter
Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he's written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.
3 Comments