Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification
Proving that a robot path-planning algorithm is within a known factor of optimal is not the kind of work you hand to an intern. It requires domain-specific geometric intuition, multi-step mathematical reasoning, and the patience to work through constraints that took the original researchers months to construct. It is, in other words, exactly the kind of thing researchers have been hoping large language models might eventually be able to help with — or at least attempt.
A team from George Mason University and Florida Atlantic University has now built the first benchmark designed to test whether LLMs can actually do it. The result is both more encouraging and more sobering than the headline numbers suggest.
The preprint, posted to arXiv on March 19, 2026, is titled "Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification". The lead author is Zhengbang Yang, a PhD student at George Mason, and the senior author is Zhuangdi Zhu, an assistant professor in the university's Department of Cyber Security Engineering and a former senior data scientist at Microsoft. Co-authors Md. Tasin Tazwar and Minghan Wei are at Florida Atlantic University.
Path planning — figuring out how a robot gets from A to B, covers a territory, or completes a delivery route under energy or distance constraints — is foundational to almost every mobile robot in production. The catch is that the most interesting variants of the problem are NP-hard, meaning exact solutions don't scale. What practitioners rely on instead are approximation algorithms: algorithms with provable guarantees that their output is within some ratio of the true optimum. A warehouse robot navigating orders, a drone covering a search grid, a fleet of autonomous vehicles splitting up a delivery region — all of them are running some version of this. The quality of the approximation ratio directly affects real-world performance: energy consumption, mission completion times, cost.
Designing those algorithms is hard. Proving their approximation bounds is harder. And that proof work — graduate-student-hours, faculty-weeks — is what the George Mason team asked LLMs to attempt.
Their benchmark consists of 34 proof tasks drawn from 11 peer-reviewed robotics papers. The tasks span three problem families: energy-constrained coverage path planning (grids, polygons, contours with charging constraints), distance- and capacity-constrained vehicle routing variants, and geometric neighborhood touring in three dimensions. Crucially, the benchmark anonymizes algorithm and problem names to prevent the models from reciting memorized proofs rather than reasoning from the provided material.
Four models were evaluated: OpenAI's GPT-5.2, Google's Gemini 3 Pro, xAI's Grok 4.1, and Alibaba's open-source Qwen3.5. The paper notes that Anthropic's Claude 4.6 was excluded after it consistently timed out on the proof tasks — a notable omission given Claude's performance on other reasoning benchmarks. Each model was tested across four conditions: no context, domain-specific lemmas provided, the correct approximation ratio given as oracle knowledge, and both lemmas plus oracle ratio together.
The full results show a field that can occasionally manage graduate-level theorem work but cannot do it reliably. Without any scaffolding, GPT-5.2 led with a 26.47 percent success rate — meaning it produced a fully valid proof for fewer than one in three tasks. Gemini 3 Pro hit 11.76 percent. Grok 4.1 reached 14.71 percent. Qwen3.5 managed 8.82 percent unaided.
The more interesting finding comes when you add domain-specific lemmas — essentially giving the model the geometric or algorithmic stepping stones that a human researcher would have internalized from prior coursework. With that context, Qwen3.5 surged to 44.12 percent, outperforming every proprietary model tested. Gemini 3 Pro reached 35.29 percent, GPT-5.2 dropped slightly to 32.35 percent. The open-source model, given the right hints, outperformed the flagship proprietary models. The ceiling, even with both lemmas and the answer provided as oracle knowledge, was 50 percent for GPT-5.2 — which means the best-case scenario, with maximum assistance, is still one valid proof out of two.
Generic chain-of-thought prompting, the paper found, does not help and can actively hurt: it increases hallucination rates without improving logical rigor. Only structured domain context moves the needle.
The dominant failure mode identified in the paper's error analysis is what the authors call unsupported inference — the model takes a logical leap it cannot justify, asserts a relationship between quantities without deriving it, or skips a step that turns out to be the hard part. Other common failures include incorrect inequality directions, overgeneralization from specific cases, and the insertion of irrelevant assumptions. Hallucination, in this context, is not fabricating citations — it is fabricating proof steps.
This work sits adjacent to but distinct from an earlier direction in the field. A 2025 preprint presented at NAACL, "Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools", explored routing LLMs through SMT solvers to verify satisfiability of planning instances. That approach uses LLMs to formalize a problem so a solver can check it. The George Mason work asks a different and harder question: can the LLM itself produce the proof that an algorithm is provably near-optimal — the kind of mathematical argument a robotics researcher writes in a paper? These are not competing approaches; they are different rungs on a ladder.
The benchmark uses GPT-5.2 as both test subject and LLM-as-judge — a methodological concern worth flagging. GPT-5.2 evaluating its own outputs on 10-point scales across three dimensions (approximation ratio correctness, logical validity, and relevance to provided context) introduces a potential self-favorability bias, even if the paper reports that human researchers validated the judge's assessments. The benchmark's 34 tasks span only three problem families drawn from 11 papers, which is enough to establish the landscape but not enough to make strong claims about generalization. And the code repository remains under anonymous review, with no public data release yet.
None of that undercuts the core result, which is genuinely useful: LLMs can produce valid approximation-optimality proofs at a non-trivial rate when given structured domain context, they fail reliably in characteristic ways, and those failure modes can be partially addressed with targeted augmentation. For researchers who spend weeks on this work, that is at minimum a diagnostic — and potentially the beginning of a useful tool.
Zhu's lab at George Mason has been building toward exactly this kind of infrastructure: the paper includes an automated extraction pipeline that uses GPT-5.2 to convert PDFs to structured LaTeX, previewing what a full-stack proof assistant for planning algorithms might eventually look like. Whether LLMs get from 44 percent to something operationally useful is the open question — and it will probably depend less on raw model capability than on how well researchers can encode domain knowledge into context that models can actually use.