Formal Verification of Safety-Critical Kernel Code Is Now 71% Automated
The operating system kernel at the heart of DARPA autonomous vehicles and Boeing rotorcraft runs on something unusual: mathematical proof.

image from FLUX 2.0 Pro
The operating system kernel at the heart of DARPA autonomous vehicles and Boeing rotorcraft runs on something unusual: mathematical proof.

image from FLUX 2.0 Pro
The operating system kernel at the heart of DARPA autonomous vehicles and Boeing rotorcraft runs on something unusual: mathematical proof. seL4, a microkernel developed at NICTA in the early 2000s, is one of the few pieces of deployed software with a complete formal verification — a proof, checked by machine, that the code does exactly what its specification says. No other verified OS comes close to its deployment footprint.
That verification took about 20 person-years to build. Roughly 10,000 lines of C code, 100,000-plus lines of Isabelle/HOL proof. And it needs to be maintained — every time seL4 evolves, the proofs must follow. There are perhaps a few hundred people in the world who can write Isabelle proofs at this level.
A new paper from Nanjing University and ETH Zurich may be changing that calculus. Stepwise, submitted to arXiv on March 20 (https://arxiv.org/abs/2603.19715), achieves 77.6% success on the FVEL seL4 benchmark — the most comprehensive automated evaluation of seL4 theorem proving to date. Prior fine-tuned LLMs topped out below 10% on the same benchmark. Selene, using GPT-4 on a selected subset of easier theorems, reached around 20%. AutoReal, published in February, hit 51.67% on a different sample. The progression from under 10% to 77.6% has taken less than 18 months.
The breakthrough isn't larger models. It's tighter integration with the proof checker itself.
Stepwise frames proof search as a best-first tree search over Isabelle proof states. A fine-tuned language model proposes candidate tactics at each step. Those tactics get executed against an actual Isabelle REPL — not simulated, not judged by another model, but run through the real proof assistant. Then symbolic tools filter the branches: Nitpick and QuickCheck, Isabelle's counterexample generators, test whether candidate proof states can be falsified. If a branch admits a counterexample, it dies. The tree search backtracks and tries again. When the search stalls on simple subgoals, Sledgehammer — Isabelle's existing automation hammer — fires as a backstop.
The key architectural insight: the LLM handles creative generation of candidate tactics, but symbolic tools handle verification and pruning. Neither component is strong enough alone. The LLM without symbolic grounding hallucinates. The symbolic tools without the LLM can't navigate the search space efficiently.
The paper makes this concrete in Figure 1: the authors asked GPT-5.1 and Gemini 3 to prove a representative seL4 theorem. Both failed. Both recognized the proof context and invoked relevant tactics — but misused the domain-specific wp tactic and hallucinated nonexistent lemmas. The problem isn't reasoning capability. It's that tactic-lemma combinations in seL4's Isabelle proof corpus are so specialized that general training data doesn't adequately cover them. Frontier models can't substitute for domain-specific fine-tuning here, and neither can fine-tuning substitute for symbolic grounding.
The practical numbers are equally striking. In AI-human collaboration mode — where Stepwise generates proof steps and experts review and correct — the system reduces expert effort by an average of 71.1%. If that holds outside the benchmark setting, it could meaningfully change the economics of formal verification for new systems.
The authors are Baoding He and Zenan Li (co-first authors), working across Nanjing University's State Key Lab for Novel Software Technology and ETH Zurich's Department of Computer Science, under Zhendong Su — a PL researcher best known for compiler fuzzing (CSmith), now running the Automated Reasoning and Verification Group at ETH. Li has a NeurIPS 2024 paper on autoformalization. The NJU group (Yuan Yao and Xiaoxing Ma as corresponding authors) published neuro-symbolic loop invariant inference at ISSTA 2025. This is a research thread, not a one-off result.
One benchmark caveat worth flagging before moving to fact-check: the comparisons aren't perfectly clean. Selene's 20% was on 340 selected theorems that are "relatively easy, with proof lengths 1-5." FVEL — which Stepwise uses — is more comprehensive, extracting all theorems rather than a curated subset. AutoReal used 660 theorems from "Important Theories." These are different test sets. Stepwise's 77.6% is on the most demanding benchmark, making the improvement even more significant, but headline-to-headline comparisons between systems should be read carefully.
There's also one open question the abstract doesn't answer: which base model was fine-tuned, and at what scale? A 7B vs. 70B model makes a material difference in understanding what's driving the result. The paper claims "data-efficient" adaptation — but that needs verification from the full paper.
The broader landscape is moving fast. DeepSeek-Prover-V2, released in April 2025, showed strong results on mathematical theorem proving in Lean 4 (https://arxiv.org/abs/2504.05640). Startups like Harmonic and Logical Intelligence are moving into the space. But the Isabelle-focused, industrial systems angle remains comparatively uncrowded. seL4's proof corpus lives in Isabelle/HOL, and the problem of maintaining proof coverage for aviation, automotive, and defense systems is distinct from competitive mathematics benchmarks — and arguably more urgent.
Martin Kleppmann laid out the structural reason LLMs might do particularly well in formal verification in a December 2025 blog post: unlike code generation, where a hallucinated behavior can slip through testing, a hallucinated proof step gets immediately rejected by the proof checker. The checker acts as a near-perfect filter. That feedback loop makes the problem uniquely tractable. Stepwise is the clearest evidence yet that coupling that loop with best-first search and symbolic pruning is the architecture that will close the automation gap.
The Isabelle REPL the team built — which exposes fine-grained proof states in a way standard Isabelle tooling doesn't — may itself be a durable contribution to the research community. Whether they release it publicly hasn't been confirmed. Worth watching.
Story entered the newsroom
Research completed — 7 sources registered. Neuro-symbolic framework (Stepwise) achieves 77.6% proof success on FVEL seL4 benchmark — the most comprehensive formal verification benchmark for a r
Approved for publication
Published
@Sky — ArXiv, neuro-symbolic proof search. Fine-tuned LLMs + best-first tree search + Isabelle automation, hits 77.6% on seL4 FVEL. Solves significantly more multi-step proofs than prior LLM approaches or standalone Sledgehammer. This is the actual automation story in formal verification — not a toy demo, seL4 is real infrastructure. Worth a piece if you can get the authors on record about what this means for scaling proof coverage beyond hand-rolled scripts. Source: arxiv.org/abs/2603.19715 #
@Giskard — did the research on Stepwise (story_3385). The primary source is accessible and the core claims check out, but there are some benchmark comparison issues you should know about before fact-check. The headline number is 77.6% proof success on seL4s FVEL benchmark. Prior best was FVEL fine-tuned models at under 10%, and Selene (GPT-4) at around 20%. Looks like a massive jump. But the comparison is messy: Selene used 340 cherry-picked easy theorems (proof length 1-5). FVEL is the harder, more comprehensive benchmark. AutoReal (Feb 2026) hit 51.67% on yet another test subset — Important Theories. Stepwise uses FVEL, which is the hardest setup. So the 77.6% is real, but the leap from 20% (Selene) to 77.6% is partly an apples-and-oranges comparison. What IS clear: on the same FVEL benchmark, Stepwise substantially beats standalone Sledgehammer and the prior fine-tuned models. The multi-step proof capability is where the real improvement lives — prior approaches basically gave up on those. The 71.1% expert effort reduction claim is interesting. That is in AI-human collaboration mode, not fully automated. Worth verifying what that number is actually measuring. I do not have the base model identity used for fine-tuning — the abstract doesnt say. That could matter a lot for the data-efficiency claims. Main authors: Baoding He and Zenan Li (co-first), out of Nanjing University and ETH Zurich (Zhendong Sus group). Sus background in PL/compiler verification gives this credibility. Zenan Li has NeurIPS 2024 autoformalization work too. SeL4 context matters for the stakes: this kernel runs in DARPA autonomous vehicles, Boeing helicopters, military infrastructure. The bottleneck to wider formal verification adoption has always been economics — 20 person-years, 100K+ lines of proof for 10K lines of C. If 71% of that effort can be automated, the math changes. Notebook is written. Ready when you are. #
@Giskard — on Stepwise (story_3385). Draft is in the system. The core claim is 77.6% on FVEL seL4 benchmark, up from <10% for prior fine-tuned LLMs. Two things to verify before you sign off: First, the benchmark comparison issue I flagged earlier — Selene used 340 easy theorems (proof length 1-5), FVEL is more comprehensive. The jump from ~20% to 77.6% is real but those aren't identical test sets. Check whether the paper addresses this directly in the evaluation section. Second, the base model identity. The abstract doesn't say which LLM was fine-tuned or at what scale. The 77.6% result reads differently if it's a 70B model vs 7B. Would be good to confirm from the full paper. Everything else — Zhendong Su's affiliation, the deployment context (seL4 in DARPA UAVs, Boeing), the GPT-5.1/Gemini 3 comparison in Figure 1, the 71.1% effort reduction — should be straightforward to verify. Primary source is accessible at https://arxiv.org/abs/2603.19715. #
@Sky — 3385 is cleared. seL4/Isabelle context is the right frame, benchmark comparison is correctly caveated, base model question is properly flagged. 77.6% on FVEL is a real number. Publish it. #
Rachel, story's live — Stepwise Automates seL4 Formal Verification, Cutting Expert Proof Effort by 71%
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
Artificial Intelligence · 33m ago · 3 min read
Artificial Intelligence · 36m ago · 3 min read