AI Models Ace Tests Alone. Put Them On A Live Team With A Clock, And They Freeze.

AI Models Ace Tests Alone. Put Them On A Live Team With A Clock, And They Freeze. — type0 | type0

PREVIEWAI Models Ace Tests Alone. Put Them On A Live Team With A Clock, And They Freeze. · MD

Today's AI benchmarks hand a model a problem, give it time to think, and leave it alone. A new benchmark changes all three: it puts two AI systems in a room with a ticking clock, hands one of them a bomb it can see and the other a manual it cannot, and makes them defuse it together or not at all. Every multimodal model the researchers tested failed to clear a single bomb in real time. The way they failed tells a more useful story than the score.

The setup borrows from Keep Talking and Nobody Explodes, a cooperative game where one player stares at a bomb covered in wires while a teammate reads the defusal manual out loud. Neither can win alone; communication is the game. The new benchmark, GPTNT, ports that dynamic to multimodal AI agents and isolates three conditions real collaboration demands but most AI tests skip: time pressure, asymmetric information, and imperfect communication.

What the first experiments document is a recurring pattern, not a one-off miss. Across the systems tested, four failure modes show up: state tracking, where the defuser loses track of what the bomb currently looks like after a few exchanges; action under pressure, where even a correct instruction is fumbled or skipped as the timer runs down; ambiguity handling, where the systems do not reliably ask the right clarifying question when the manual reads unclearly; and error recovery, where a wrong step goes unnoticed, undiagnosed, and uncorrected. Human pairs playing the same configuration clear bombs that the AI pairs do not.

GPTNT is asynchronous and live, which is the deliberate departure from turn-based agent benchmarks. One agent can see and manipulate the bomb but cannot read the manual. The other has the manual but cannot see the bomb. Either alone has the full skill set required. Neither has the full information set. To win, the expert has to describe the manual clearly enough and the defuser has to track and act on that description accurately enough before the countdown ends. Bomb modules are procedurally generated, so memorized walkthroughs do not transfer from run to run.

These failures map onto the gap the benchmark was built to expose. Existing multimodal evaluations, including image captioning, document understanding, visual question answering, and even turn-based agent benchmarks, measure component capabilities. They confirm that today's frontier models can read a manual, can describe a picture, and can follow a sequence of instructions. GPTNT measures whether those components deploy under the conditions where collaboration actually happens: when the partner cannot see what you see, when the clock is running, and when the message channel is the only thing holding the work together.

The project site frames the gap explicitly. Component evaluations answer whether the model can see this and whether it can follow that. The collaboration question is whether the model can keep two halves of a task aligned when neither half is complete on its own. Until that question is asked, claims about how well AI systems collaborate rest on inference from tests that never required collaboration in the first place.

The scope of the result matters. The paper's finding is that no multimodal AI system tested defused a single bomb under GPTNT's live, asymmetric, time-pressured conditions. That is a finding about the systems the authors evaluated on the configuration they built. It is not a finding about AI in general, and it is not a claim that AI cannot collaborate. It is a finding about measurement: turn-based, fully-informed, single-agent benchmarks do not predict performance in conditions they were never designed to test.

The bomb game is the controlled environment. The failure modes are the substance. The takeaway is about evaluation design: if future agent benchmarks keep testing component skills in isolation while deployment contexts keep asking for live coordination, the gap between benchmark scores and working systems will keep widening, and the next "AI passed the test" headline will keep failing to predict what happens when the clock starts.

AI Models Ace Tests Alone. Put Them On A Live Team With A Clock, And They Freeze.

Sources