Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores
When a large language model tells you something with total confidence, you want to know whether that confidence is earned. You especially want to know before the model is running on quantized inference hardware in a production system where nobody is watching every output. That is the problem a team at the Technion — Israel Institute of Technology is trying to solve, and their approach is novel enough to be worth sitting with.
In a paper posted to arXiv on March 17, 2026, Zvi Badash, Yonatan Belinkov, and Moti Freiman propose a method they call Intra-Layer Local Information Scores — a way to estimate how uncertain an LLM really is about what it is saying, using only the patterns of activation agreement across layers during a single forward pass. No second forward pass, no architectural changes, no ensemble. Just a compact signature of how the model's layers internally agree or disagree.
The standard approaches to this problem have real tradeoffs. Output-based heuristics — looking at token probabilities or entropy at the final layer — are cheap but brittle. A model can assign high probability to a fluent, coherent answer that is factually wrong. Probing — training a classifier on internal activations — is more effective but requires storing high-dimensional representations for every token and training a separate model. It is also notoriously hard to transfer from one distribution to another, which matters when your model encounters something outside its training data.
What Badash and colleagues do is extract pairwise KL divergences between temperature-scaled softmax distributions of post-MLP activations across all L layers. That gives an L×L signature map per token — a compact representation of cross-layer agreement. A LightGBM classifier then maps that signature to a per-instance uncertainty score. The whole thing runs in a single forward pass, which is the point: you want this to be cheap enough to run in production, not just in a research evaluation.
The results hold up across three models — Llama-3.1-8B (base, non-instruct), Qwen3-14B-Instruct, and Mistral-7B-Instruct-v0.3 — and ten datasets including TriviaQA, MMLU, Natural Questions, and GSM8K. In-distribution, the method matches probing, with mean diagonal differences of at most -1.8 AUPRC percentage points and +4.9 Brier score points. The more interesting number is what happens under cross-dataset transfer: off-diagonal gains of up to +2.86 AUPRC percentage points and +21.02 Brier points over probing. That gap is where the method earns its claim to being more transferable.
But the number that matters most for anyone actually deploying these models is the quantization result. Under 4-bit weight-only quantization — the kind of aggressive compression that is common in production inference — the method improves over probing by +1.94 AUPRC percentage points and +5.33 Brier points on average on Qwen3-14B. Probing degrades under quantization; the KL-divergence signatures do not. That is not a small thing. Quantization is how you make inference cheaper, and most existing uncertainty methods fall apart when you apply it.
Why does the method survive quantization when probing does not? The paper does not give a definitive answer, but the authors note that examining specific layer-layer interactions reveals differences in how disparate models encode uncertainty — suggesting the KL-divergence signatures are picking up something structural about how layers co-evolve rather than absolute activation magnitudes that get distorted by compression. That is a hypothesis worth testing in follow-up work.
Belinkov is a known quantity in the ML community — his work on robust NLP models and adversarial attacks has been widely cited. That this paper comes from Technion rather than one of the major labs is worth noting. It is the kind of contribution that could easily get lost in the benchmark churn of big-lab publications, but it is exactly the kind of low-profile research that production engineers actually use.
The honest limitation here is that this is still a paper result on static evaluations. The datasets are standard benchmarks; the models are evaluated in controlled settings. Real production uncertainty estimation has to deal with adversarial inputs, distributional shift that is not clean cross-dataset transfer, and the question of what to do when the model is uncertain — fallback, human escalation, abstention. The paper does not address any of that. It tells you how uncertain the model is; it does not tell you what your system should do with that signal.
That is the gap that will determine whether this work actually matters. A uncertainty score that nobody acts on is just a number. If the production tooling around this — decision thresholds, fallback pipelines, integration with human-in-the-loop workflows — gets built on top of this method, it could be a genuine piece of infrastructure for deploying LLMs in high-stakes settings. Right now it is a promising result in search of a system.
The paper is open access under CC BY-NC-ND 4.0 and available at arXiv:2603.22299.