1 in 4 Quantum Decoders Actually Works

PREVIEW1 in 4 Quantum Decoders Actually Works · MD

When a quantum hardware team publishes a threshold number, the first question a careful reviewer should ask is: which decoder did you use? A new benchmarking study suggests the answer matters more than the community has been admitting. Dennis Delali Kwesi Wayo at Georgia Tech's College of Computing, working with collaborators at Volkswagen AG, RWTH Aachen, the Federal University of Juiz de Fora, and the University of Lübeck, has run the most systematic comparison to date of four surface-code decoders under identical conditions. The result is a quiet indictment of how threshold estimates are currently reported across the field: only one of the four decoders produced stable crossing distributions under bootstrap analysis, and the choice of decoder changed the threshold estimate by amounts that would alter a hardware program's roadmap.

The practical stakes are concrete. QpiAI, for instance, has built a hardware Union-Find decoder on its 64-qubit Kaveri superconducting processor that reduces QEC cycle time from tens of microseconds to roughly 1.5 microseconds — bringing decoding within the coherence limits of current devices. Any speedup in the decoder software layer matters when every correction cycle is eating into the time available before errors accumulate. The paper, posted to arXiv on March 25, 2026, benchmarks Minimum Weight Perfect Matching (MWPM), Union-Find, Belief Propagation (BP), and a neural-guided variant of MWPM using the LiDMaS+ decoder framework arXiv:2603.25757. The core benchmarking task is threshold estimation — finding the noise level at which increasing the size of an error-correcting code stops providing protection. That crossing point is supposed to be a property of the hardware and the noise channel, not the software used to decode it. The paper shows that assumption doesn't hold.

Crossing-bootstrap diagnostics were stable only for MWPM, with median crossing noise levels of sigma_3,5 = 0.10 (across 1,911 of 2,000 valid bootstrap samples) and sigma_5,7 = 0.1375 (1,941 of 2,000 valid) arXiv:2603.25757. Union-Find, BP, and the neural-guided MWPM variant returned zero valid crossing samples in the bootstrap analysis — the crossing distributions were inconsistent enough that no reliable threshold estimate could be extracted arXiv:2603.25757. Dense-window scanning over the full noise range from sigma 0.08 to 0.24 produced NaN crossings for every decoder, confirming that the threshold estimate depends on both the decoding algorithm and how the scanning window is defined arXiv:2603.25757.

A companion paper from the same group (arXiv:2603.06730) provides independent Pauli-mode baseline data. At code distance 5, MWPM reduced mean logical error rate from 0.384 (Union-Find) to 0.260, producing a stable crossing median of approximately 0.053 arXiv:2603.06730. In hybrid fixed-distance runs, Union-Find was substantially worse than MWPM (mean LER 0.1657 versus 0.1195), while a trained neural-guided MWPM tracked MWPM closely at LER 0.1158 arXiv:2603.06730. The neural-guided variant showed elevated decoder-failure rates — the fraction of correction cycles where the decoder itself can't produce a valid correction — reaching 0.1335 at code distance 7 under the highest noise conditions arXiv:2603.06730.

At the specific operating point of distance 5 and sigma 0.20 in native GKP mode, MWPM and Union-Find define the Pareto frontier: MWPM at 1.341 seconds and LER 0.2273, Union-Find at 1.332 seconds and LER 0.2303 arXiv:2603.25757. The raw difference is 0.003 in LER — the 95% confidence interval for the MWPM-minus-Union-Find delta spans [-0.0104, 0.00329] and includes zero arXiv:2603.25757. BP is dominated at this operating point: 7.640 seconds and LER 0.6107, worse than both on runtime and error rate arXiv:2603.25757. Neural-guided MWPM is slower and less accurate than plain MWPM at this point, at 1.396 seconds and LER 0.3730 arXiv:2603.25757.

The threading parallelization delivered 1.34x speedup in Pauli mode and 1.94x in native GKP mode, with mean absolute LER deltas of 0.00607 and 0.00520 respectively between threaded and single-threaded runs — the paper calls this statistically faithful, and the deltas are small enough not to be a practical concern, but they are real numbers rather than effectively zero arXiv:2603.25757.

The sensitivity analysis reveals what hardware teams should actually be worried about. For both MWPM and Union-Find, measurement noise is the dominant sensitivity axis: the estimated slope of LER with respect to measurement noise was approximately 20.5, compared to 1.4 for gate noise and 1.3 for idle noise arXiv:2603.25757. Loss actually reduces LER, with a slope of -1.15. A hardware team spending engineering cycles on gate fidelity while neglecting measurement fidelity is optimizing the wrong axis.

The practical implication is simple: any time someone publishes a threshold number, the decoder specification is as important as the noise model. If two hardware groups report thresholds that differ by more than the margin of error, and they're using different decoders, the gap may be in the software stack rather than the hardware. The field has treated threshold estimation as a hardware characterization problem. This paper suggests it is also a software benchmarking problem — and one that has been underreported.

The paper also raises a methodological question the authors do not fully resolve: whether a universal threshold can be recovered with the right combination of decoder and estimator, or whether threshold estimates will remain decoder-dependent. Dense-window scanning, which is one natural way to sweep across noise ranges, produced NaN crossings for every algorithm — including MWPM — in this study. That is a signal worth tracking in follow-up work.

Wayo's team — Dennis Delali Kwesi Wayo at the Georgia Tech College of Computing in Atlanta, with Chinonso Onah at Volkswagen AG and RWTH Aachen, Leonardo Goliatt at the Federal University of Juiz de Fora in Brazil, and Sven Groppe at the University of Lübeck in Germany — ran all experiments using LiDMaS+ v1.1.0 arXiv:2603.25757. The companion paper provides the Pauli-mode baseline that contextualizes the GKP-mode results arXiv:2603.06730.

For hardware teams: specify your decoder when publishing thresholds. For the community: the crossing-bootstrap diagnostic used here — running threshold estimation across thousands of resampled datasets and checking for consistency — is a sanity check that should become standard practice. And if your MWPM threshold and someone else's MWPM threshold disagree, the gap isn't necessarily a hardware difference. It might be a window-size difference, or a noise model mismatch, or a version of the decoder with different tiebreaking behavior. Those details matter now, where they didn't seem to matter before.

1 in 4 Quantum Decoders Actually Works — type0 | type0

1 in 4 Quantum Decoders Actually Works

Sources