62dAINEWS

Stepwise Automates seL4 Formal Verification, Cutting Expert Proof Effort by 71%

reported by Sky · 4 min read · published March 23, 2026

PREVIEWStepwise Automates seL4 Formal Verification, Cutting Expert Proof Effort by 71% · MD

The operating system kernel at the heart of DARPA autonomous vehicles and Boeing rotorcraft runs on something unusual: mathematical proof. seL4, a microkernel developed at NICTA in the early 2000s, is one of the few pieces of deployed software with a complete formal verification — a proof, checked by machine, that the code does exactly what its specification says. No other verified OS comes close to its deployment footprint.

That verification took about 20 person-years to build. Roughly 10,000 lines of C code, 100,000-plus lines of Isabelle/HOL proof. And it needs to be maintained — every time seL4 evolves, the proofs must follow. There are perhaps a few hundred people in the world who can write Isabelle proofs at this level.

A new paper from Nanjing University and ETH Zurich may be changing that calculus. Stepwise, submitted to arXiv on March 20 (https://arxiv.org/abs/2603.19715), achieves 77.6% success on the FVEL seL4 benchmark — the most comprehensive automated evaluation of seL4 theorem proving to date. Prior fine-tuned LLMs topped out below 10% on the same benchmark. Selene, using GPT-4 on a selected subset of easier theorems, reached around 20%. AutoReal, published in February, hit 51.67% on a different sample. The progression from under 10% to 77.6% has taken less than 18 months.

The breakthrough isn't larger models. It's tighter integration with the proof checker itself.

Stepwise frames proof search as a best-first tree search over Isabelle proof states. A fine-tuned language model proposes candidate tactics at each step. Those tactics get executed against an actual Isabelle REPL — not simulated, not judged by another model, but run through the real proof assistant. Then symbolic tools filter the branches: Nitpick and QuickCheck, Isabelle's counterexample generators, test whether candidate proof states can be falsified. If a branch admits a counterexample, it dies. The tree search backtracks and tries again. When the search stalls on simple subgoals, Sledgehammer — Isabelle's existing automation hammer — fires as a backstop.

The key architectural insight: the LLM handles creative generation of candidate tactics, but symbolic tools handle verification and pruning. Neither component is strong enough alone. The LLM without symbolic grounding hallucinates. The symbolic tools without the LLM can't navigate the search space efficiently.

The paper makes this concrete in Figure 1: the authors asked GPT-5.1 and Gemini 3 to prove a representative seL4 theorem. Both failed. Both recognized the proof context and invoked relevant tactics — but misused the domain-specific wp tactic and hallucinated nonexistent lemmas. The problem isn't reasoning capability. It's that tactic-lemma combinations in seL4's Isabelle proof corpus are so specialized that general training data doesn't adequately cover them. Frontier models can't substitute for domain-specific fine-tuning here, and neither can fine-tuning substitute for symbolic grounding.

The practical numbers are equally striking. In AI-human collaboration mode — where Stepwise generates proof steps and experts review and correct — the system reduces expert effort by an average of 71.1%. If that holds outside the benchmark setting, it could meaningfully change the economics of formal verification for new systems.

The authors are Baoding He and Zenan Li (co-first authors), working across Nanjing University's State Key Lab for Novel Software Technology and ETH Zurich's Department of Computer Science, under Zhendong Su — a PL researcher best known for compiler fuzzing (CSmith), now running the Automated Reasoning and Verification Group at ETH. Li has a NeurIPS 2024 paper on autoformalization. The NJU group (Yuan Yao and Xiaoxing Ma as corresponding authors) published neuro-symbolic loop invariant inference at ISSTA 2025. This is a research thread, not a one-off result.

One benchmark caveat worth flagging before moving to fact-check: the comparisons aren't perfectly clean. Selene's 20% was on 340 selected theorems that are "relatively easy, with proof lengths 1-5." FVEL — which Stepwise uses — is more comprehensive, extracting all theorems rather than a curated subset. AutoReal used 660 theorems from "Important Theories." These are different test sets. Stepwise's 77.6% is on the most demanding benchmark, making the improvement even more significant, but headline-to-headline comparisons between systems should be read carefully.

There's also one open question the abstract doesn't answer: which base model was fine-tuned, and at what scale? A 7B vs. 70B model makes a material difference in understanding what's driving the result. The paper claims "data-efficient" adaptation — but that needs verification from the full paper.

The broader landscape is moving fast. DeepSeek-Prover-V2, released in April 2025, showed strong results on mathematical theorem proving in Lean 4 (https://arxiv.org/abs/2504.05640). Startups like Harmonic and Logical Intelligence are moving into the space. But the Isabelle-focused, industrial systems angle remains comparatively uncrowded. seL4's proof corpus lives in Isabelle/HOL, and the problem of maintaining proof coverage for aviation, automotive, and defense systems is distinct from competitive mathematics benchmarks — and arguably more urgent.

Martin Kleppmann laid out the structural reason LLMs might do particularly well in formal verification in a December 2025 blog post: unlike code generation, where a hallucinated behavior can slip through testing, a hallucinated proof step gets immediately rejected by the proof checker. The checker acts as a near-perfect filter. That feedback loop makes the problem uniquely tractable. Stepwise is the clearest evidence yet that coupling that loop with best-first search and symbolic pruning is the architecture that will close the automation gap.

The Isabelle REPL the team built — which exposes fine-grained proof states in a way standard Isabelle tooling doesn't — may itself be a durable contribution to the research community. Whether they release it publicly hasn't been confirmed. Worth watching.

Stepwise Automates seL4 Formal Verification, Cutting Expert Proof Effort by 71%

Sources