A Quantum Benchmark Built to Withstand a Classical Impostor
A new benchmark for shallow NISQ circuits uses a nonlinear cross entropy score and a heavy output classifier to keep classical spoofers from faking a quantum result.
A new benchmark for shallow NISQ circuits uses a nonlinear cross entropy score and a heavy output classifier to keep classical spoofers from faking a quantum result.
Quantum computers keep failing the tests the field gives them, and for once the problem is not the hardware. A new benchmark, posted to arXiv as 2605.22909, argues that the test itself has been the soft spot, and proposes a replacement: a nonlinear cross-entropy score that holds up against the best classical spoofers a research group can throw at it.
The target is a regime quantum experimentalists care about deeply. Shallow-depth all-to-all random quantum circuits are one of the leading frameworks for showing that noisy intermediate-scale quantum (NISQ) hardware is doing something a classical machine plausibly cannot. Sampling from those circuits is widely believed to be hard for classical computers at short depth, which is what makes them useful as a proving ground. The trouble is that the existing test, linear cross-entropy benchmarking, has been a disappointment in practice. An adversarial classical spoofer can fake a strong linear cross-entropy score by exploiting the very noise the benchmark is supposed to tolerate.
The new paper, titled "Sample-efficient benchmarking of shallow all-to-all random quantum circuits," attacks that gap directly. The authors propose a binary classifier built on heavy output generation, the share of bitstrings that occur more often than the uniform random baseline. Under the right construction, that classifier can be trained to separate genuine quantum hardware from a state-of-the-art classical spoofer with sample complexity that scales only logarithmically with system size at short depth. In other words, the test gets harder to fake in proportion to the size of the device, not in proportion to the noise, which is what makes it practical to run on real hardware rather than just to analyze on paper.
The evidence base for the claim is split in two. The authors derive exact analytic expressions for all-to-all Brownian circuit ensembles, using replica-trick methods, and back the result with numerical simulations of discrete Haar-random unitary circuits. The Brownian calculation gives the benchmark its theoretical teeth. The Haar-random numerics show the result survives the messier, finite-circuit regime that real hardware actually inhabits. Both pieces matter, because the field has been burned before by benchmarks that worked in one model and collapsed in another.
The paper is a preprint, not a peer-reviewed result, and it sits squarely in the NISQ regime. The authors are not claiming error-corrected or fault-tolerant quantum advantage. They are claiming a specific, falsifiable separation between noisy quantum hardware and a particular class of classical adversary, in a specific circuit family, at short depth. That is a narrower claim than "quantum supremacy," and the paper is more useful for it. A benchmark that survives a competent spoofer is the kind of tool experimental groups can actually use, and that is what makes it a yardstick rather than a flag to plant.
The broader context is a field that has been publicly working through how to measure itself honestly. A Simons Foundation article from May 2026 describes a quantum dynamics result that overturns a prior quantum-supremacy claim and points to the same moral: a benchmark that does not survive adversarial pressure is not a benchmark. The Quantum Insider's March 2026 piece on a quantum benchmarking initiative reads as an ecosystem-level acknowledgment of the same problem, with the sector trying to build the infrastructure that separates a real quantum result from a marketing one.
What to watch next is whether the heavy-output binary classifier holds up outside the analytic regime the authors analyzed. The result assumes all-to-all connectivity and shallow depth, which is the right place to start, but the interesting question is whether the logarithmic sample complexity survives the moment a real superconducting or trapped-ion group tries to use it. If independent hardware groups can reproduce the separation on a real device, the benchmark becomes a candidate standard. If they cannot, the constructive claim softens. For now, the paper is the most concrete proposal this year for a NISQ-regime test that refuses to be fooled by a clever classical spoofer, and the bar it sets is the one the next round of quantum advantage claims will be measured against.