The Efficiency Cliff: Why Every Major AI Lab May Be Training on the Wrong Algorithm

The Efficiency Cliff: Why Every Major AI Lab May Be Training on the Wrong Algorithm — type0 | type0

PREVIEWThe Efficiency Cliff: Why Every Major AI Lab May Be Training on the Wrong Algorithm · MD

The short version: every major AI lab may be spending exponentially more on data than they need to, because the dominant training method has a hidden mathematical flaw that nobody looked for until now.

A paper from Korchinski, Favero, and Wyart at EPFL, posted to arXiv on May 26, offers a precise diagnosis. The authors prove that token-level self-supervised learning — the method behind GPT, BERT, and every major diffusion model — requires exponentially more training samples to learn hierarchical structure in data, compared to an alternative called latent prediction. The gap scales exponentially with the depth of the hidden hierarchy.

Latent prediction, by contrast, achieves the same result with sample complexity that is effectively constant with respect to hierarchy depth. As the paper shows: token-level SSL requires O(v·m^(L+1)) samples while latent prediction achieves O(v·m^3), both up to logarithmic factors. L is the hierarchy depth. In the hard regime, that difference is exponential.

The implication, if the theory holds, is stark. A lab training a next-token model on a trillion tokens is not just spending more than necessary — it is spending exponentially more, and the exponential compounds as the model encounters deeper structures in the data.

The mechanism is intuitive once you see it. Token-level prediction asks the model to work backward from surface forms to the hidden causes that generated them. Each step up the latent tree weakens the learning signal. Latent prediction changes the target: instead of predicting a raw token, the model predicts a representation produced by the model itself. Once useful low-level abstractions form, they become the learning signal for higher-level structure. The hierarchy climbs itself.

Meta's V-JEPA 2, a 1.2 billion parameter world model trained on video, is built on exactly this principle. The paper's most surprising finding: data2vec implicitly performs the same kind of hierarchical latent prediction, giving it the same exponential efficiency advantage without anyone labeling it as such.

The paper goes further: it argues that explicit hierarchical stacking, of the kind used in H-JEPA, may be largely redundant. The latent-prediction objective itself induces hierarchical structure. Architecture and objective are not doing the same work twice. That is the claim that will draw the most scrutiny from practitioners building real systems.

Yann LeCun, who oversees AI research at Meta, endorsed the result on Twitter. Given that V-JEPA is central to Meta's stated path toward machine intelligence, that co-sign carries weight.

The caveat that matters most: the theory is proven in a controlled grammar setting, not in natural language or images. The Random Hierarchy Model is a clean mathematical object. Whether the exponential gap survives contact with real data — with its noise, its exceptions, its contextual ambiguities — remains an open empirical question. If natural language does not exhibit the hierarchical statistical structure the theory assumes, the exponential advantage evaporates.

But the cost of that uncertainty falls unevenly. The labs running token-level SSL at scale — OpenAI, Google, Meta, Mistral, Anthropic — are the ones paying the exponential tax if the theory holds. The builders working on world models and latent prediction are the ones with the exit ramp.

The Efficiency Cliff: Why Every Major AI Lab May Be Training on the Wrong Algorithm

Sources