An Exact Decomposition for Why Neural Network Curvature Varies by Layer Type — type0 | type0

PREVIEWAn Exact Decomposition for Why Neural Network Curvature Varies by Layer Type · MD

An Exact Decomposition for Why Neural Network Curvature Varies by Layer Type

A single-author preprint turns an empirical puzzle — why the Hessian power-law exponent α sits near 2 in convolutions, near 1 in attention, and below 1 in MLP up-projections — into a tractable geometric question, and ships an architecture-adaptive optimizer that exploits the answer on vision benchmarks, with the limits intact.

The empirical pattern

Across modern neural networks, the eigenvalues of the loss Hessian scale with gradient singular values as a power law, h_k ∝ σ_k^α. The exponent α is not universal. The author documents it at roughly 2 for convolutions, roughly 1 for transformer attention heads, and below 1 for MLP up-projections, across 93 layers, five architectures, and three datasets. On Tiny-ImageNet the 49 convolution layers land at a median α of 1.93; across 14 ImageNet-1K conv layers the model fits with R² ≥ 0.97; across the 21 layers of ResNet-18 the median R² is 0.98. Those numbers are confirmed by the paper's own bundled claim-checker, verify_claims.py, which cross-references 27 quantitative claims against 15 frozen JSON result files Anherutowa Calvo, arXiv:2606.02596; 9D Labs reproducibility repo.

Until now, the field has mostly catalogued that pattern. A preprint from Anherutowa Calvo at 9D Labs (arXiv:2606.02596, submitted 22 May 2026; 13 pages, 6 figures, 3 tables) is the first to give it an exact cause.

The one-line decomposition

The whole story, in one line, is

α = 2 + d log Φ_k / d log σ_k,

where Φ_k measures how aligned the Kronecker-factor eigenbases of a layer's curvature are with the gradient's singular directions. In plain language: every layer carries a Kronecker-factored curvature. The gradient's singular vectors pick out a preferred set of directions in that geometry. Φ_k is the cosine-style similarity between those two sets of directions, expressed as a function of σ_k. The derivative of its log with respect to log σ_k is the only thing standing between a layer and the canonical α = 2 power law. The decomposition is Theorem 1 in the paper; the worked examples for LayerNorm, residual connections, and softmax heads are in Section 6 Anherutowa Calvo, arXiv:2606.02596.

The payoff is that the geometry has closed-form answers for the architectural features that matter. LayerNorm, residual connections, and softmax heads all produce specific, derivable values of d log Φ_k / d log σ_k, which is why the empirical α has the values it has.

The s = αγ identity and what the error bounds actually mean

The decomposition implies a no-free-parameter algebraic identity, s = αγ, that ties three independent quantities together: the curvature exponent α, the gradient rank-decay exponent γ, and the Hessian decay exponent s. Plug in α and γ measured from one set of layers, and the identity predicts s for those same layers.

The paper reports the identity recovers s to a median error of 1.9% on CIFAR, 1.0% on Tiny-ImageNet, and 1.6% on ImageNet-1K — across 93 layers, five architectures, and three datasets, with no fitted parameters. Those numbers are the output of verify_claims.py, which tests each claim against frozen JSON and exits non-zero on any mismatch Anherutowa Calvo, arXiv:2606.02596; 9D Labs verify_claims.py. The identity is exact by construction; the error numbers are the empirical agreement between the prediction and the data. The zeta-function bound on the participation ratio (PR_h ≤ ζ(αγ)²/ζ(2αγ) ≈ 2.5 at αγ = 2) is a separate argument showing why a single exponent is a sensible summary in the first place: despite hundreds of singular directions, curvature concentrates onto roughly one to two effective directions per layer.

Spectral Newton as a proof of concept

The decomposition is constructive. Plug the layer's α and σ_k into an architecture-adaptive preconditioner T(σ; α) = σ/(σ^α + d), implement it in the gradient's singular basis, and you get Spectral Newton. Where α sits near 2 — the convolution-heavy vision regime — the author reports that Spectral Newton outperforms AdamW on vision benchmarks (Section 7.3 / Table 3 of the paper) Anherutowa Calvo, arXiv:2606.02596.

That is the proof-of-concept beat, not the headline. The point of the preconditioner is not the leaderboard; it is that curvature is now something an optimizer designer can read off the architecture and exploit directly, rather than treat as a black-box empirical knob.

What is still open

The constructive frame survives only if the limits are named. They are real. The work is single-author, from a small lab (9D Labs), and is a preprint on arXiv — it has not been peer reviewed and has not been independently replicated. The 1.9%/1.0%/1.6% error bounds are encouraging empirical evidence for the identity, not a mathematical proof. The Spectral Newton win is shown on vision benchmarks in the α ≈ 2 regime; it has not been compared head-to-head with Muon, Soap, or Shampoo at frontier scale, and the result is not demonstrated for language models. Φ_k is a geometric quantity whose estimation cost at frontier scale is not analyzed in the paper. The author flags the boundary of the claim; the boundary is the claim.

Curvature as something you can design

The reason the story is worth covering even if Spectral Newton is overtaken next month is structural. The empirical pattern of α varying by layer type is a fact the field has carried for years; the new contribution is that it now has a one-line geometric cause, with closed-form answers for the architectural features that determine it, and a usable artifact at the end. Curvature stops being a quantity to be discovered and starts being a quantity to be designed against.

That is the constructive read. Whether the optimizer holds, the geometry does.

An Exact Decomposition for Why Neural Network Curvature Varies by Layer Type

An Exact Decomposition for Why Neural Network Curvature Varies by Layer Type

The empirical pattern

The one-line decomposition

The s = αγ identity and what the error bounds actually mean

Spectral Newton as a proof of concept

What is still open

Curvature as something you can design

Sources