A controlled study of GPT 2 style models finds that as layers deepen, smaller variants collapse their internal token vectors into a narrow cone, while larger models resist the effect.
Every token that flows through a Transformer language model is, at any given layer, a point in a high-dimensional vector space, a coordinate on an internal map the model is drawing in real time. New work from Chen Liu and colleagues (LM-Dispersion project page) shows that for small models, that map starts to collapse. Token vectors end up pointing in nearly the same direction, as if every word in a sentence were being funneled into the same narrow cone-shaped region of the space.
The team calls the effect "embedding condensation" and uses the rest of the paper to characterize when it happens, why, and what a simple training fix does about it. The headline finding for anyone who builds or compresses language models is what the result implies about the workhorse compression technique, knowledge distillation: the larger teacher's resistance to condensation does not pass through. The student inherits the teacher's task accuracy, not its geometry.
The geometric intuition is straightforward. At the input layer, a model's token vectors are spread across a high-dimensional space. As the model processes a sentence, attention and feed-forward blocks transform those vectors layer by layer. In a healthy model the vectors remain well separated, with different tokens still occupying different directions. In a condensed model the vectors have rotated toward each other. Pairwise cosine similarity between token vectors, which measures how close their directions are, climbs toward one.
Liu's team showed that within a single model family, smaller variants condense more than larger ones. The effect holds across input datasets, so it is not a quirk of the evaluation corpus (LM-Dispersion project page, Figure 2). The team confirmed the pattern under strict confounder controls: when they trained GPT-2-style models that varied only in the hidden dimension of their feed-forward blocks, holding layers, embedding size, dataset, and training recipe fixed, the same scaling relationship emerged (LM-Dispersion project page, Figure 3). Model size, not some other dimension of the architecture, drives the collapse.
That matters because it puts a specific mechanism on the table rather than the generic observation that small models underperform. If the problem is geometric under-utilization of the representation space, the fix can target that mechanism.
One of the more useful findings for practitioners is timing. Condensation is not a slow degradation caused by the training process. The team observed the cone formation already at model initialization, before any data has been seen, and pre-training then partially reverses it rather than making it worse (LM-Dispersion project page, Figure 4). The phenomenon is closer to an architectural inductive bias, something about how parameters are initialized and how the layer geometry is wired, than a side effect of data or optimization.
It also means a practitioner cannot fix condensation by tweaking the dataset, the learning rate schedule, or the training horizon. The cone is in the room from the start.
The cleanest negative result is Figure 5. Knowledge distillation trains a small "student" model to match a larger "teacher" model's outputs, and is the standard way to ship small models that retain most of a large one's accuracy. In a controlled distillation setup the student reached the teacher's downstream scores. The student did not reach the teacher's geometry. The cone was still there (LM-Dispersion project page, Figure 5).
The implication is direct: task-level fidelity and representation-level fidelity are separable, and output-matching on the training distribution does not bring the second along for free.
The team's proposed countermeasure is a small additive term in the training objective, called dispersion loss, that explicitly penalizes the token vectors for clustering into the same direction. They report that the addition improves generalization on small language models, with the arXiv abstract page for 2602.00217 and the rendered HTML version detailing the formulation and benchmark results. The Hacker News discussion (item 48780826) has already surfaced implementation questions from practitioners. The author's project page remains the most reader-accessible summary of the result.
A few limits apply. The "condensation" label is new, and the effect has clear kinship to representation collapse, neural collapse, and dimensional collapse in deep networks more broadly. The "distillation fails to transfer" claim is a controlled finding, made by varying scale while holding everything else fixed, and should be read as a statement about that setting rather than a universal verdict on distillation. Co-authorship and institutional affiliation should be checked against the arXiv abstract page before any downstream piece asserts them.
The open question is whether the same geometric story holds beyond the GPT-2-style models the team measured, and whether dispersion loss remains a useful lever once a model is large enough that condensation is already mild. The next preprint revision should help close that gap.