Your LLM Does Not Know What It Does Not Know

reported by Sky · 4 min read · published May 25, 2026

PREVIEWYour LLM Does Not Know What It Does Not Know · MD

There is a number your AI is running right now. Every time a language model answers a question — should we approve this loan, diagnose this patient, write this legal brief — a confidence score comes with it. The number is the maximum softmax probability, MSP: the highest token probability the model assigns to its own output. It is cheap, it is fast, and according to a new paper from King's College London, it is lying to you.

The paper, submitted to ICML 2026 on May 19, makes a simple claim: the information you need to know whether an LLM is about to hallucinate is not in the final answer. It is in the path.

The researchers — Aliai Eusebi, Alexander Herzog, Xiaoyu Liang, Marie Vasek, Enrico Mariconti, and Lorenzo Cavallaro — tracked how language models build their representations layer by layer by following the cumulative contributions of each layer's MLP sub-network to the model's residual stream. Think of it as reading the model's working memory as it thinks, not just its final answer. They extracted eleven scale-invariant geometric features from these trajectories — properties of how updates accumulate, reverse, drift, or commit across depth — and fed them to a sparse linear probe. The result outperformed MSP on selective abstention tasks by up to 21 AURC points. The gains grew precisely where baseline miscalibration was worst: the more overconfident the model, the more the trajectory signal revealed.

The key word is interpretable. Every feature has a closed-form geometric meaning. A coefficient does not just say the model is wrong — it says which layer committed prematurely, which layer contradicted the running state, where the trajectory drifted from its endpoint. A radiologist reading a scan does not just get a probability score; they get a lesion. The probe produces something similar: a layer-level diagnostic rather than a single confidence number.

"Weaknesses at the end reflect early divergence," the paper puts it. "Errors that MSP assigns to the final layer actually originate much earlier."

The paper validates across nine instruction-tuned models — Qwen, Llama, and DeepSeek families ranging from 3 billion to 72 billion parameters — on five natural language processing tasks. It builds on earlier work by Marks and Tegmark at MIT, who found linear structure separating true from false statements in LLM hidden states, and on Apple's 2024 benchmarking study, which found that multi-sample uncertainty methods often outperform single-sample approaches by only marginal amounts despite substantially higher computational cost.

The catch is real. The validation is benchmarks only, not adversarial or out-of-distribution inputs. The code repository is anonymous and not yet publicly accessible. The three model families tested are all open-weights; there is no evidence yet that trajectory probing improves GPT-4 class closed models. A 21 AURC point improvement on a selective abstention task is substantial by academic standards — but it is a lab measurement on multiple-choice questions, not a deployment test on a system under adversarial prompting.

That said, the mechanism is plausible and the framing matters. MSP has been the default uncertainty signal in production AI systems precisely because it is cheap and available everywhere. If the information the paper describes — trajectory evidence of premature commitment, state contradiction, endpoint drift — is genuinely recoverable from layer-wise MLP updates in transformer architectures, then the question is not whether it works but whether it can be deployed at inference cost. A sparse linear probe over eleven features is not heavy. The authors used an NVIDIA H100 NVL GPU for their experiments, but they are reading MLP outputs that are already being computed during the forward pass. The marginal cost of trajectory extraction is not the same as running the model multiple times.

The practical implication is for anyone building on top of language model outputs: if you are using MSP to decide whether to defer a decision to a human, you may be flying blind in proportion to your model's miscalibration. The paper suggests the blind spot was always there — flattened out when the model collapsed its layer-wise computation into a single probability distribution. The trajectory was telling you. You just were not reading it.

What the paper does not yet show is whether this survives contact with the real world: adversarial inputs, distribution shift, the kind of user behavior that real production systems encounter. The 21 AURC point gain is a ceiling, not a floor. It is a genuine signal in a constrained setting. Whether it holds up when someone is actively trying to fool your model is an open question — and that is the right question to ask before anyone ships it.

Your LLM Does Not Know What It Does Not Know

Sources