The Stroop Test Stumps Every Major Chatbot. What That Means for Human-Level AI

The Stroop Test Stumps Every Major Chatbot. What That Means for Human-Level AI — type0 | type0

PREVIEWThe Stroop Test Stumps Every Major Chatbot. What That Means for Human-Level AI · MD

The test is ninety years old and childishly simple: name the ink color of a color word. Print the word "red" in blue ink and the right answer is "blue," even though the word screams red. The point is to suppress the automatic urge to read and answer with what the eyes actually see. Psychologists have used this color-word interference task since 1935 to measure a faculty they call executive control, the brain's ability to hold a goal in mind while ignoring distractions.

In a peer-reviewed paper published this month in PNAS Nexus, a CUNY-led team reports that every major chatbot they tested fails this test in the same way. On a single word, GPT-4o and Claude 3.5 Sonnet answer correctly. Stack the words into a short list and accuracy holds. Make the list longer, mix matching and mismatching items, and the same models start defaulting to the word instead of the color. The pattern is so consistent that the researchers describe it as a fingerprint of a missing piece in how these systems pay attention.

The quantitative collapse is sharp. According to the PNAS Nexus study, GPT-4o dropped from roughly 91 percent accuracy on five-word incongruent lists to about 57 percent at ten words and around 15 percent at forty. Claude 3.5 Sonnet held up through twenty words, then fell to roughly 24 percent at forty. In mixed lists, accuracy on the incongruent items alone fell toward zero. Replications on GPT-5, Claude Opus 4.1, and Gemini 2.5 reproduced the same load-dependent collapse, Neuroscience News reports, with ScienceDaily and Singularity Hub carrying the same numbers. The collapse is the empirical story; everything else is interpretation.

The reason, the authors argue, is structural. The mechanism that makes modern chatbots work is called self-attention, a token-weighting computation introduced in the 2017 paper "Attention Is All You Need" by Vaswani and colleagues. Every large language model now in production, from Claude to Gemini to ChatGPT, is a descendant of that architecture. Self-attention lets a model decide which earlier words to weight when generating the next one. It is, in the authors' framing, an analog of attention without any analog of executive control: the part of human attention, formalized by neuroscientists like Michael Posner, that holds a goal active and inhibits competing responses.

To make that concrete: when a human takes the Stroop test, three networks collaborate. Alerting keeps the brain awake and ready. Orienting steers the eyes to the right place. Executive control suppresses "red" so the mouth can say "blue." The CUNY team, led by Suketu Patel, argues that transformer self-attention has machinery for the first two and nothing that does the third. The Stroop result is what that absence looks like under load.

In their own words, the authors describe today's AI attention as "fundamentally limited." They write that adding mechanisms "similar to those in biological attention is crucial for achieving artificial general intelligence." That framing is theirs, not consensus; it is a load-bearing claim from a single research group, and reasonable readers will weigh it differently. The underlying result, that increasing demand reliably breaks the same models in the same way, is independent of how one feels about AGI.

That result is part of a broader shift in how researchers probe machine cognition. Standard NLP benchmarks, the leaderboards that grade models on translation or reading comprehension, measure what models can produce. Cognitive psychology batteries measure how they process. Stroop tests executive control. Theory-of-mind tests probe whether a model can model another's beliefs. Personality and emotional-intelligence tests, increasingly common in the literature, probe stability under varied prompts. The shift matters because it surfaces failure modes that benchmarks tend to hide.

Successor work on attention architectures is already moving in directions consistent with the paper's diagnosis. Sparse attention, memory-augmented attention, and retrieval-augmented variants are being explored across the field. A representative recent paper, "S3-Attention," explores selective state-space attention for long-context workloads. None of these directly replicate the brain's executive control network, and the PNAS Nexus paper does not claim they do. The point is narrower: today's flagship architecture has a named, isolable limit under a ninety-year-old test, and the authors are betting that limit is the next place to look.

For readers, the portable move is small but useful. The next time a chatbot capability claim catches your eye, ask what the test conditions were. How long was the input. Was the distractor the same kind every time, or mixed. Did the model have to suppress a strong prior to give the right answer. A short list with matched pairs is not the same test as a long mixed list, and the difference is exactly where today's architectures stop behaving like the cognitive systems they are often compared to.

The Stroop Test Stumps Every Major Chatbot. What That Means for Human-Level AI

Sources