Robotics Is Still Waiting for Its Own Chatbot Arena
A new crowd judged leaderboard for robot policies is starting to fill the gap, but the field is still learning what it would take to measure progress toward general purpose robots.
A new crowd judged leaderboard for robot policies is starting to fill the gap, but the field is still learning what it would take to measure progress toward general purpose robots.
The Chatbot Arena, the volunteer-judged comparison where users pit language models against each other and pick the better answer, has become the de facto live scoreboard for AI chatbots. Robotics has no equivalent. That gap is starting to look like the story, not the missing artifact, because the first attempts to build one are revealing exactly why measuring progress toward general-purpose robots is so much harder than measuring progress in language.
In a June 13 analysis on the 'It Can Think!' Substack, robotics researcher Chris Paxton lays out what those first experiments are teaching the field. The piece, titled What Do Robotics Leaderboards Tell Us About The State of Robot Learning?, treats evaluation as infrastructure. It catalogs what is already there, what is missing, and what would have to be true for the field to converge on a credible live measure of progress toward embodied general intelligence, the shorthand researchers use for general-purpose robots that can do real-world tasks.
Two structural blockers show up over and over. First, running a robotics benchmark is expensive in a way that text evaluation is not. A language-model arena can rerun the same prompt against two models in seconds; a robot benchmark typically needs physical hardware, controlled resets, or large-scale simulation, all of which cost money and time. Second, a benchmark only works if the community uses it, and community buy-in depends on the task being hard enough to be informative, easy enough that outside teams can attempt it, and realistic enough that beating the benchmark says something about the real world. Paxton flags both conditions as unsolved.
The closest things the field has are static benchmark suites and one-off competitions. PolarRiS, a benchmark suite that Paxton cites as a useful proxy for real-world robot performance, illustrates the trade-off. It correlates with what robots actually do, but it is not a live leaderboard: the authors do not update it, and the entrant set is fixed at paper-writing time. That means it can rank the policies that existed when it was released, not the ones shipping now. Similar limits apply to the open-vocabulary mobile manipulation competition, or OVMM, a recurring contest in which robots are asked to follow free-form natural-language instructions in simulated home environments. It is one of the most realistic public tasks, and it runs on a fixed schedule, which is closer to a leaderboard but still not a continuous, community-driven arena.
The most interesting recent experiment is RoboArena, a new crowd-judged leaderboard for robot policies. Instead of a single scripted score, RoboArena uses volunteer-pair judging, a model borrowed from the Chatbot Arena: two policies face the same task, a human picks the better one, and the result is aggregated into a ranking. Early results from that model are starting to show which evaluation choices matter most. Reset realism, the question of how cleanly the environment is returned to a known starting state between attempts, turns out to be quietly decisive. If resets are too easy, policies can game the setup; if they are too hard, the benchmark becomes uneconomical to run. The same trade-off shows up in anti-gaming. Once a benchmark becomes prestigious, the incentive to overfit to it grows. Paxton points to the Llama 4 controversy as the cautionary tale on the language-model side, where a model posted to a public leaderboard under conditions that did not match its public configuration. A robotics arena would face the same temptation, with the added complication that gaming a physical task often requires gaming a simulator instead.
What does the gap actually cost the field? Mostly, the ability to make honest comparative claims. A team that improves a manipulation policy can show a better success rate on PolarRiS, but the claim stops at the policies and environments that suite was built around. Without a live arena, there is no way to ask the equivalent of "is this new robot policy better than last year's best, on a task neither team tuned for?" That question is the one that, in language models, the Chatbot Arena effectively answers every day. Robotics does not have a version of it yet.
Paxton is explicit that the absence is not permanent. The point of the post is that evaluation is a buildable infrastructure problem, and the field is now iterating on it. RoboArena's volunteer-pair judging is one design choice. Static suites that correlate with real-world performance are another. The next step is something that combines continuous updates, realistic resets, and a community that wants to climb it. Whether that arrives depends on the same ingredients any infrastructure project depends on: cost, buy-in, and the willingness to publish a ranking that some entrant will try to game.
The honest read of the state of robot learning is that the science of training policies is moving faster than the science of measuring them. Until those two move together, every claim about progress toward general-purpose robots is going to come with the same footnote: the field still does not have a public scoreboard everyone agrees on.