When you train a neural network with Adam or SGD, you are averaging. Every data point contributes to a gradient; those contributions are pooled into a single number; the network updates toward that average. It is a blunt instrument that discards information about what each individual training example actually wants.
Sven, a new optimizer from researchers at MIT, Oxford, and the Institut des Hautes Études Scientifiques, does not average. It treats each data point as a separate constraint and asks a single question: what parameter update would satisfy all of them at once? The answer comes from the Moore-Penrose pseudoinverse of the per-sample loss Jacobian, computed efficiently via a truncated singular value decomposition. The result is an optimizer that is aware of the geometry of the loss landscape in a way that standard gradient methods are not.
The paper, posted to arXiv on April 1 as MIT-CTP/6022, shows Sven outperforming Adam on regression tasks, converging faster and to lower final loss while remaining competitive with LBFGS at a fraction of the wall-time cost. The computational overhead over SGD is only a factor of k, where k is the number of retained singular values, compared to traditional natural gradient methods whose cost scales with the square of the number of parameters.
Standard natural gradient descent attempts something conceptually similar: it respects the loss geometry by updating parameters in the space of the function itself rather than in raw parameter coordinates. But it requires inverting the Fisher information matrix, which scales as O(P^2) for P parameters. For a model with 10 million parameters, that is a 100 trillion-element matrix. Sven recasts the problem as per-sample rather than per-model, making the scaling tractable.
The memory question is the one that will determine whether this matters. The per-sample Jacobian has shape (B, P) where B is batch size and P is the number of parameters, so memory scales as O(B * P). A batch of 32 with a 10 million parameter model means storing well over a gigabyte for the Jacobian alone, on top of the model weights themselves. The authors call this the primary scaling challenge and propose mitigation strategies including random parameter subsets and microbatching, though both require modifications to standard autograd tooling that do not yet exist. The paper is honest about this: it focuses on small-scale regression experiments and explicitly calls scaling "an important direction for future work."
That honesty is part of why this is worth reading. The authors are not claiming Sven replaces Adam for large language model training. They are pointing at a structural difference in how optimization can work and showing it delivers real gains on problems where the loss decomposes naturally into individual constraints.
The approach has already left the lab. A companion paper, MIT-CTP/6023 posted the same day, describes using Sven for a numerical modular bootstrap problem in theoretical physics, where standard gradient descent fails because the loss landscape has exponential hierarchies that SGD cannot navigate. Sven found better conformal field theories by being able to traverse multiple parameter directions at once, determined by the singular values of the Jacobian.
In the over-parameterized regime, Sven can achieve exponential loss decay rather than the power-law behavior typical of first-order methods, according to the authors' GitHub.
Samuel Bright-Thonney and Thomas R. Harvey of MIT, Andre Lukas of Oxford, and Jesse Thaler of MIT are the authors. Thaler, a physicist who has worked extensively on jet physics and AI applications in fundamental physics, is also a co-author of the modular bootstrap paper.
The broader point is that natural gradient methods have been theoretically attractive for decades but practically inaccessible at scale. Sven does not solve the scaling problem entirely, but it reframes it in a way that separates the computational cost from the memory cost, identifies the memory problem as the actual bottleneck, and gives the community a concrete target. Whether that target gets hit determines whether this is a useful tool for scientific computing or something that changes how neural networks are trained more broadly.
The code is on GitHub, implemented as a PyTorch optimizer with a model wrapper that converts standard modules to a functional form for per-sample Jacobian computation. It is a drop-in replacement for the training loop with two changes: wrap the model with SvenWrapper, and call loss_and_grad instead of loss.backward(). A worked example comparing Sven to Adam on 1D regression is included.
† Remove 'Institut des Hautes Études Scientifiques' unless confirmed in the paper itself, or verify the affiliation before publication.