The Interpretability Gap: Researchers Who Build Advanced AI Want Tools to Audit It

PREVIEWThe Interpretability Gap: Researchers Who Build Advanced AI Want Tools to Audit It · MD

The scientists who helped build today's most capable AI systems are publicly warning that nobody, including them, can fully explain how those systems reach their conclusions, and they want the gap treated as a measurement problem, not a mystery.

Eric Horvitz, chief scientific officer at Microsoft, and Robert West, a computer scientist at EPFL in Switzerland, outlined the dangers of putting AI interpretability on the back burner in a Science article published June 2026. They call for new AI benchmarks and dedicated tooling to unpick the inner workings of advanced AI, framing the effort as analogous to the long, hard work of understanding the human brain. "Preserving human agency must therefore remain a central goal," the authors write.

The practical stakes are not abstract. AI tools already shape how people search, decide, and form judgments, and the researchers who build them are not immune. If the people writing the systems cannot trace the path from input to output, then the people using those systems to vet a job applicant, summarize a deposition, or rank a medical claim are inheriting that opacity every time they accept an answer. As the authors put it: "While human understanding of AI declines, AI understanding of humans deepens, producing new forms of behavioral opacity."

A small but growing body of work is already attacking the problem from directions borrowed from neuroscience and psychology. Anthropic, creator of Claude, has linked patterns of algorithmic activity to specific concepts and reverse engineered parts of neural networks to expose how internal computations shape responses. OpenAI is training algorithms that work in more explainable steps and building reasoning models that pause, "think," and justify their conclusions in plain language. DeepMind is building microscope-like tools for neural networks, helping researchers peer into their decision-making process — a project it calls Gemma Scope.

Others borrow from psychology, treating AI as a participant in behavioral studies. Google Brain has applied cognitive psychology methods to study AI behavior, while researchers have investigated whether LLMs can mimic aspects of "theory of mind" — the ability to infer what others are thinking and feeling. Another method, inspired by how the brain etches memories during sleep, reduces the tendency of AI to forget old knowledge while learning new tasks.

What Horvitz and West are pushing for is a step beyond any single lab's tooling: shared benchmarks that let the field as a whole measure interpretability the way it measures accuracy. A model that can pass a reasoning test and a coding test should also be expected to pass a test that asks it to explain, faithfully, why it reached the answer it did, and an outside auditor should be able to verify that explanation against the model's internals. That kind of benchmark does not exist at scale yet, and the authors' argument is that the window for setting the rules while the systems are still small enough to inspect is narrowing faster than the tooling is arriving.

Three trends are compounding the problem. First, AI is increasingly being used to train, benchmark, and improve other AI — AI "judges" now score metrics like helpfulness, rank outputs, detect hallucinations, and assess new releases. Using AI to solve an AI-induced problem introduces a paradox: if AI-generated explanations become too complex for humans to verify, opacity compounds. Second, networks of interacting AI agents — already common in scientific research and drug discovery — may develop communication patterns that drift from human language and reasoning, making them harder to interpret. Third, as AI systems trained on human behavior increasingly shape how people search, decide, and form judgments, the asymmetry noted by the authors deepens: while human understanding of AI declines, AI understanding of humans grows.

The constructive frame is deliberate. The article does not describe AI opacity as a property of some future superintelligence; it describes it as the present state of systems already deployed in search engines, customer service flows, and coding assistants. "The goal is not just more capable AI, but AI that is more intelligible, accountable, and aligned with human aims," the authors write. Treating the gap as a design problem with active workstreams, rather than a doom scenario, is what makes the warning actionable: it gives regulators, customers, and procurement teams something concrete to ask for, and gives researchers something concrete to compete on.

The Interpretability Gap: Researchers Who Build Advanced AI Want Tools to Audit It — type0 | type0

The Interpretability Gap: Researchers Who Build Advanced AI Want Tools to Audit It

Sources