When two people need to communicate and share no grammar, they invent one. Traders on colonial routes coined pidgin. Linguists call it the same pressure that drove ancient civilizations to build structured number systems from scratch. A new benchmark puts this idea under an unusual microscope: it drops humans and AI agents into identical communicative constraints and asks what each side invents under pressure.
The work comes from Cooper Cognitive, a small research shop run by Michael Cooper and Samuel Cooper, and was presented at a workshop called HCAIR — short for Human-Centered AI Reasoning, a venue for research at the intersection of how people think and how machines do — held at ICLR, the International Conference on Learning Representations, a major annual AI research meeting taking place this week in Rio de Janeiro.
The setup: a Speaker sees an image of objects arranged in a grid; a Listener must identify the matching image from candidates based on a single message. Messages are limited to eight characters drawn from a set of eight symbols — the researchers chose A, B, C, 0, 1, 2, +, and *. No pretrained math, no human language, no shared grammar. The point is to test whether reasoning can emerge from grounded communication — whether a system can build working structure from scratch under communicative pressure, rather than retrieving stored answers.
On out-of-distribution tasks — novel combinations of symbols and quantities the models had never encountered — the best baseline model scored as low as 23 percent. Human pairs on the same subtest scored 69 percent. Overall: 72 percent for the best model, 87 percent for humans, per the paper's results table.
The human advantage is not simply a matter of raw intelligence. The researchers describe what happened when participants confronted the same constraints as the AI agents: they reinvented base-3 encoding, a positional number system structurally identical to what ancient Mesopotamian and Mesoamerican civilizations developed over centuries. The mechanism was the same in both cases — communicative necessity forcing the invention of structure. You need to be understood; therefore you build a shared system. The researchers call this emergent numeracy: not abstract knowledge waiting to be retrieved, but a practical tool forged under the weight of communication.
The benchmark has known limitations. Human participants had technical backgrounds and were given notebooks — resources not available to the AI agents, a plausible advantage the paper does not fully control for. No major AI lab has yet run their frontier models on Math Takes Two; the current results come from symbolic autoencoders, not GPT-class systems. Whether frontier models would do better, or whether they would find more sophisticated ways to pattern-match around the constraints, is an open question.
The broader argument the paper makes is that mathematical reasoning may require communicative pressure to emerge — that numbers were invented not as abstract truths waiting to be discovered, but as practical tools for making oneself understood. If that is right, the path to AI reasoning may not be scaling alone. It may require putting AI agents in positions where they must be understood by other agents under constraints that forbid shortcuts.
A separate evaluation, the COMMA benchmark, found that chain-of-thought reasoning systems — a class that includes models like OpenAI's o-series and DeepSeek-R1 — perform at or below random chance when required to collaborate on novel tasks. Reasoning models built for solo performance do not automatically generalize that capability to settings requiring negotiation between agents. The Math Takes Two paper does not claim it proves the same is true for frontier models. It provides the test.
Cooper Cognitive has released the benchmark on GitHub with code and datasets. The authors argue the design is clean enough to replicate and the question is important enough to pursue at frontier scale. Whether the major labs agree and actually run their models is the experiment that has not yet happened.