The History of Software Is the History of Delegating Adaptation. MOSS Just Went One Level Deeper.
A paper submitted to arXiv on May 21 claims a self-improving agent system called MOSS improved its own performance on a standardized agentic task benchmark from a 25 percent score — near the floor — to 61 percent in a single self-editing cycle, with no human involved. To put that in context: a 25 on a four-task benchmark means the agent was failing most tasks; 61 means it was completing them correctly. The GitHub repository linked in the paper returns a 404. The code is not publicly available, and no independent researcher has replicated the result.
That empirical gap is the lede problem. MOSS, from researchers at the University of Science and Technology of China, Hong Kong University of Science and Technology, and Hong Kong Baptist University, makes a theoretical claim worth taking seriously: that agents must rewrite their own source code to truly self-improve, not just adjust prompts, skills, or memory. The distinction matters. The harness layer — routing, dispatch, session state, hook ordering — lives in code. No text edit can reach it. The paper's comparison table (arXiv full text) shows Hermes Agent, SkillClaw, GenericAgent, and EvoAgentX all reaching the text layer but none reaching the harness. MOSS claims to be the only one that does.
The mechanism: MOSS delegates code modification to an external coding-agent — Claude Code, OpenAI Codex, DeepSeek-TUI, or OpenCode — selected at runtime by configuration. MOSS retains control over stage ordering and verdicts. Candidates are verified by replaying a curated batch of production failures against the candidate in ephemeral trial workers, then promoted via a user-consent-gated container swap with health-probe rollback. The evolution lifecycle surfaces as a conversational CLI the agent drives through the same interface a user would for ordinary tasks.
Every meaningful shift in software has been, at root, a negotiation over who does the adapting. Compilers moved translation from humans to machines. Higher-level languages moved abstraction from machine code to something a programmer could hold in her head. Prompt engineering moved capability from fixed code into flexible text. Each delegation expanded what the machine could own and compressed what the human had to specify. Source-level self-rewriting is the next delegation — and the reason it differs from prior ones is that it reaches a layer none of the others could touch.
What the paper calls the harness layer is not a marginal component. In production agentic systems, it is where routing decisions are made, where concurrent skills are coordinated, where session state is maintained across turns. Failures at this layer — misrouted messages, hooks firing out of order, corrupted session state — are unreachable from the text layer. A prompt cannot fix a routing bug. A skill file cannot correct a dispatch failure. The paper's comparison table makes this vivid: Hermes Agent, SkillClaw, GenericAgent, and EvoAgentX all reach the text layer but none reach the harness. MOSS is the only system in the comparison that does.
This is a structurally different claim from the usual self-improvement framing, as The New Stack noted in its survey of competing systems. Most agent frameworks treat self-improvement as a matter of giving the model better instructions. MOSS treats it as a matter of changing the machine that interprets the instructions. The distinction matters because it determines what kinds of failures are fixable without a human in the loop.
The paper also makes an architectural argument that deserves attention independent of the empirical results. Source-level adaptation, the authors contend, is Turing-complete in a way that text-mutable adaptation is not — every text-artifact configuration space is a strict subset of what code can express. Text-layer fixes take effect through base-model compliance; the model must correctly read new instructions and behave accordingly. Code-level fixes take effect deterministically. And text-layer fixes erode under long-context drift; as prompts, skills, and memory entries accumulate, the model's adherence to any single piece of guidance dilutes. Source-level edits do not drift because they are encoded as behavior, not text to be re-read.
The practical implication, if the claims hold, is that every agentic framework currently in production that restricts self-improvement to the text layer is working with a structurally capped capability ceiling. The ceiling is not a parameter you can tune. It is a boundary of the medium.
There is a significant gap in the verification chain. The GitHub repository linked in the paper returns a 404. The code is not publicly available. The 0.25-to-0.61 performance jump was reported by the authors on self-selected OpenClaw tasks, not independently replicated. The evaluation methodology is described in the paper but not yet verified by external parties.
This matters for how to read the result. The theoretical contribution — that source-level adaptation is strictly more general than text-level adaptation, that the harness layer is unreachable from the text layer, that deterministic fixes do not erode under context drift — stands on its own as a systems-design argument. The empirical contribution requires verification. The delegation pattern itself is notable as a design signal: MOSS does not claim to be a better coding agent. It is a meta-orchestrator that instructs other coding agents to rewrite source while holding the stage ordering and verdict logic. If this pattern is sound, it implies the next competitive moat in agentic AI is not the coding model but the orchestration layer that decides when to call it and how to verify the result.
What makes the historical framing apt is that each prior delegation of adaptation produced a new category of tool, a new category of failure, and a new negotiation over what humans still needed to own. Compilers produced register allocation bugs alongside faster translation. Higher-level languages produced abstraction leaks alongside programmer productivity. Prompt engineering has produced reward hacking alongside capability gains. Source-level self-rewriting, if it scales, will produce its own category of failure — specifically, the class of bugs where an agent's rewrite of its own harness introduces a failure that is not reachable from the text layer, because the agent will then be unable to fix itself using the tools it has.
The paper describes safeguards. Evolution candidates are verified against a production-failure batch before promotion. The user must authorize the container swap. Health probes trigger automatic rollback. These are serious mitigations. Whether they are sufficient for production use at scale, and who gets to decide what sufficient means when the system can rewrite the rules by which it is judged, are questions the paper does not answer.
What is clear is that the boundary between the text layer and the code layer, which most self-improving agent frameworks treat as a permanent feature, is now being contested. Whether MOSS's specific implementation is the right answer is an open question. The question itself is now on the table.
The code is not public. The benchmark is unverified. The architectural argument is sound. The gap between what MOSS claims and what can be independently confirmed is the story.