When an LLM writes quantum code, it might look correct and still be wrong. The output of a quantum program is not a deterministic answer but a probability distribution over measurement outcomes. A circuit that initializes a state incorrectly will still run without crashing; it will just produce the wrong physics. Telling whether the result is correct requires knowing what the right result should be and testing against it.
Researchers from the American University of Beirut and King Abdullah University of Science and Technology have published QuanBench+, a benchmark that tests how accurately LLMs generate quantum code across three major software frameworks simultaneously: Qiskit, Cirq, and PennyLane. The design is deliberate. By holding the programming task constant and changing only the framework, the benchmark can distinguish between two types of failure that look identical in a single-framework test: getting the quantum algorithm wrong versus getting the framework API wrong.
The results show that current frontier LLMs score 59.5 percent on Qiskit, 54.8 percent on Cirq, and 42.9 percent on PennyLane when given a single attempt. Those numbers improve substantially when the model is allowed to see its error and try again: 83.3 percent, 76.2 percent, and 66.7 percent respectively. The lift is real, but the gap that closes is not the hard one.
Give an LLM the framework-specific boilerplate as a starting point and it handles the interface friction well. The study refers to this as the prefill condition: providing the opening imports, device initialization, and measurement scaffolding. This reduces the straightforward mistakes, the kind that come from hallucinating a function name or forgetting a required measurement step. The semantic failures, where the algorithm itself is wrong, do not improve.
The authors organize their findings around three questions. The first is straightforward accuracy across frameworks, and the numbers above answer it. The second asks whether prefill narrows the gap between framework familiarity and genuine quantum reasoning. The answer is yes for the framework part, no for the reasoning part. The third asks whether feedback repair, letting the model observe and correct its own output, recovers the remaining failures. It recovers about half of them. The rest are reasoning mistakes: wrong algorithmic structure, incorrect measurement logic, or physical errors that no amount of API familiarity fixes.
The benchmark evaluates correctness using executable functional tests. For tasks with probabilistic outputs, the paper uses a KL divergence threshold of 0.05 to determine whether a generated distribution matches the target. This is a calibration choice, and the paper's appendix describes how it was set using repeated canonical executions. The threshold matters: if it is too loose, incorrect circuits pass; too tight, and correct circuits with the wrong but equivalent implementation fail.
The underlying point is that quantum programming has a dual failure mode that does not show up in classical code generation benchmarks. A classical program either runs and returns the right answer or it does not. A quantum program can run without crashing and produce physically wrong output. That makes the test harness design critical. QuanBench+ uses distributional correctness rather than circuit structure as the metric, which is the right call: two circuits that look nothing alike can implement the same unitary transformation, and penalizing structural difference when the physics is correct is a false negative.
The research comes with the usual caveats for pre-peer-reviewed work. QuanBench+ is accepted to the ICLR 2026 workshop track, which is less rigorous than a main conference. The task set of 42 problems, while aligned across frameworks, covers a specific slice of quantum programming: algorithms, gate decomposition, and state preparation. Whether the failure modes observed here transfer to larger or differently structured programs is an open question.
The original QuanBench, published by Guo and colleagues in October 2025, established the single-framework version with 44 tasks and showed that LLMs were below 40 percent accuracy without fine-tuning. QuanBench+ extends that work by adding the cross-framework dimension, which is where the distinct failure mode story emerges. Multiple concurrent groups are working on quantum code generation for LLMs. The contribution here is the diagnostic framing, not the raw accuracy numbers.
For developers building quantum software tools, the practical implication is that improving quantum code generation requires attacking two separate problems. Framework-specific scaffolding, better API documentation in the prompt, and tighter integration with framework-specific tooling will close the API error gap. Closing the reasoning error gap requires something else: better training on quantum algorithms specifically, or tools that verify algorithmic structure before execution. Feedback repair handles the easy part. The hard part is still open.