The LLM Agent That Gets Better by Rewriting Its Own Manual

The LLM Agent That Gets Better by Rewriting Its Own Manual — type0 | type0

PREVIEWThe LLM Agent That Gets Better by Rewriting Its Own Manual · MD

The code does something the paper only claims: it makes an AI model significantly better at coding or research tasks without adding a single request to an AI service at inference time. That is the finding from auditing SkillOpt, an open-source tool from Microsoft Research. The optimizer model runs during training, offline. When it is done, the output is a plain-text file. At deployment, that file is injected into context and the target model runs without calling the optimizer again. The zero-overhead claim is not marketing. It is baked into the architecture of the export.

SkillOpt is a text-space optimizer for agent skills. During training, a frontier language model runs rollouts with the current skill document, analyzes what went wrong, proposes edits, and accepts only the ones that improve performance on a held-out validation set. This happens offline. The optimizer model calls happen here, during the training loop, not during deployment. When the training is done, the output is a single file: best_skill.md. At inference time, that file is injected into the target model's context and the model runs with no additional overhead, no additional API calls, no weight changes.

The code structurally enforces this. The training loop in trainer.py calls the optimizer model at rollout, reflection, aggregation, selection, and update stages. The evaluation stage gates acceptance. But when the best skill is exported — saved as best_skill.md — the model runs it as a static document. There is no runtime call back to the optimizer.

The paper reports large results. On GPT-5.5 in direct chat, SkillOpt improves average accuracy by 23.5 points over no-skill baseline across six benchmarks — SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, and ALFWorld. Inside the Codex agentic loop the gain is 24.8 points. Inside Claude Code it is 19.1 points. The paper reports best-or-tied results across all 52 evaluated combinations of model, benchmark, and execution harness, beating human-written skills, one-shot LLM generation, and five competing optimization methods including Trace2Skill, TextGrad, GEPA, and EvoSkill, according to the paper.

The skill artifacts are small — 300 to 2,000 tokens after 11 to 44 accepted edits — and they transfer across environments. A skill trained on GPT-5.4 for spreadsheet tasks improves every smaller GPT variant tested. A skill trained inside Codex transfers to the Claude Code execution environment with a 59.7-point gain on spreadsheet tasks. An OlympiadBench skill produces positive gains on Omni-MATH without further optimization, the Microsoft Research project page notes. This means a team can optimize a skill once, audit it as plain text, and reuse it across different models and harnesses without retraining.

Microsoft released the code under an MIT license on GitHub. It has 4.2 thousand stars and 87 commits at the time of this writing, per the repository. The code is readable, modular, and backed by a full training pipeline with support for Azure OpenAI, Anthropic Claude, Qwen, and Codex harnesses.

The strategic oddity is hard to miss. Microsoft built a tool that makes OpenAI's model perform better inside Anthropic's Claude Code than inside OpenAI's own Codex harness, and then gave it away. The paper's results show GPT-5.5 hitting higher accuracy inside Claude Code than Codex under SkillOpt optimization on several benchmarks. Microsoft open-sourced a tool that makes a competitor's product look stronger, and did so with a license that lets anyone use it commercially with no restrictions.

The obvious question is why. Microsoft is not a company that typically gives away research-grade tooling under permissive licenses when that tooling affects the competitive positioning of its most important AI partnership. The most charitable reading is that SkillOpt is infrastructure for a future where the value in agentic systems lives in skill libraries and the tooling around them, not in the models themselves — and Microsoft is trying to own that layer before the field settles. The less charitable reading is that the company is less interested in protecting any single model's advantage than in establishing skill optimization as a standard practice that drives demand for the kinds of inference infrastructure Azure can provide at scale.

There are legitimate reasons for skepticism. The 52-for-52 result across all benchmark-harness combinations is an extraordinary claim that warrants scrutiny of whether the baselines were adequately competitive and whether the benchmark suite is broad enough to generalize. The paper itself notes that the optimization is bounded — the textual learning rate, the edit budget, and the validation gate all constrain how far the skill can drift from its starting point. If those constraints are too tight, the method may be stable but limited. If they are too loose, it falls back to the chaotic self-revision the authors identify as the existing problem.

The zero-overhead claim at deployment is the one that matters most for commercial adoption. An enterprise deciding whether to build on SkillOpt needs to know that the inference pipeline runs exactly as it would without the skill file, plus the cost of injecting the document into context. The code confirms this is the case. The optimizer model is offline. The deployed system is the target model plus a static text file.

That is a real result. The question is whether it generalizes beyond the benchmark suite to the actual production environments where readers of this publication would deploy it. The code audit answers the structural question. The field test is still running.

What to watch: the GitHub activity shows the project moving from research artifact toward developer infrastructure — Qwen optimizer backend merged June 1, test infrastructure added May 31. Whether production teams are actually adopting SkillOpt in agentic pipelines, and whether the 52-for-52 result survives contact with real workloads, is the next reporting question when that signal arrives.

The LLM Agent That Gets Better by Rewriting Its Own Manual

Sources