A team at NVIDIA, the University of Oxford, and the Quebec AI Institute (MILA) has published what appears to be the most credible evidence yet that backpropagation — the algorithm that has powered nearly every major AI system since the 1980s — is not architecturally required for frontier-scale training. Their paper (arXiv 2511.16652, submitted November 2025 and revised February 2026) describes a system called EGGROLL that trains billion-parameter language models using evolution strategies and low-rank matrix perturbations, reaching throughput that the researchers call "up to 91 percent of the throughput of pure batch inference," according to the arXiv paper. The implication, stated plainly in the work: you can build large neural networks without the gradient computations that most researchers treat as foundational.
The paper's authors include Jakob Foerster and Shimon Whiteson from Oxford's machine learning group, Aaron Courville from MILA's fundamental AI research axis, and NVIDIA researchers including Bidipta Sarkar and Mattie Fellows, per the arXiv author list. That lineup is why this is getting attention instead of disappearing into the arXiv pile.
The core technical problem EGGROLL solves is a scaling wall. Evolution strategies — which treat neural network weights as a population and select the best performers across random perturbations — scale terribly with model size. Naive ES at the population sizes needed for billion-parameter models requires roughly 180 times more GPU-hours than a standard backpropagation baseline, according to the researchers' calculations. The computational cost makes naive ES nonviable for anything close to frontier scale.
EGGROLL's contribution is a low-rank perturbation trick that cuts the cost dramatically. By structuring individual weight perturbations as rank-r matrices rather than dense updates, the system reduces auxiliary storage per layer from O(mn) to O(r(m+n)) and forward-pass cost from O(mn) to O(r(m+n)), the paper reports. The researchers report a hundredfold increase in training speed for billion-parameter models compared to naive evolution strategies at large population sizes, and the low-rank update converges to a full-rank ES update at an O(1/r) rate — meaning the approximation gets better as you increase the rank without blowing up compute proportionally, the paper shows.
The numbers that matter: EGGROLL scaled a population to 262,144 individuals on a single GPU, EmergentMind reported in its summary of the work. It pretrained nonlinear recurrent language models operating entirely in integer datatypes — no gradients, no floating point — and achieved a best test loss of 3.40 bits/byte on a character-level dataset, according to the arXiv paper. For post-training reasoning tasks, the researchers found EGGROLL competitive with GRPO, a commonly used reinforcement learning baseline. These results do not beat state-of-the-art backprop-trained models. They demonstrate that the training method itself does not need to be what everyone assumed it had to be.
The 91 percent throughput figure is where the eyebrow goes up. If training actually runs at near-inference speeds, the hardware implications are significant — gradient computation is a major bottleneck in standard training, and removing it changes what the chip needs to do. Whether that number holds across different model architectures and hardware setups is the key open question.
What makes this worth writing is not the benchmark but the assumption it challenges. Backpropagation is so foundational to modern AI that most researchers treat it as architecturally inseparable from learning in deep networks. EGGROLL does not disprove that — it shows you can train large networks another way, with different tradeoffs, using different mathematics. Whether that generalizes to transformer architectures at full scale, whether the integer-datatype constraint limits expressiveness, and whether the evolution strategy approach can match gradient methods on complex reasoning tasks remain open. The paper is a preprint, not a peer-reviewed result.
Still. When a result forces researchers to update a belief they treated as a law of nature, it matters — not as a replacement for backprop but as proof the space is larger than assumed. The next thing to watch is whether anyone replicates the scaling behavior on transformers specifically. If the low-rank ES approach works on the architecture that runs every major language model, the assumption quietly becomes a choice.