Stop Picking Better Models. Fix the Harness.

PREVIEWStop Picking Better Models. Fix the Harness. · MD

Dan McAteer, an AI engineer who has spent years building agent systems, put it plainly last week: "AI Agents are 50% a harness story." The claim sounds like a provocation. A new paper from researchers at Tsinghua University and Harbin Institute of Technology makes it look like an understatement. arXiv 2603.25723, posted March 26, demonstrates that moving harness logic out of code and into editable natural language descriptions can shift benchmark performance by a meaningful margin. On OSWorld, a benchmark that tests agents on real desktop computing tasks, a system called NLAH (Natural Language Agent Harnesses) achieved 47.2 percent task success compared to 30.4 percent with a native code harness on the same underlying model. That's not a prompt engineering trick. It's a different way of structuring the infrastructure around the model.

The core idea: instead of encoding agent control logic directly in code, the harness is written as structured natural language, then executed by an Intelligent Harness Runtime (IHR). The IHR runs an LLM in the loop at each step, reading the harness description, current environment state, and a "runtime charter" that defines the task family constraints and resource budgets. The separation between the task-family harness and the shared runtime charter is the architectural move. A harness for software engineering tasks looks different from one for data analysis or customer support, but both run on the same IHR infrastructure.

"The harness is every piece of code, configuration, and execution logic that is not the model itself," as LangChain put it two weeks ago. That's a useful frame. What the Tsinghua and HIT researchers did was pull that harness out of Python and into natural language — not as a prompt, but as a structured specification the runtime executes against. The benchmark results suggest this isn't cosmetic. The OSWorld gap between NLAH and native code harness on the same model is 16.8 percentage points.

This is the part worth dwelling on: harness logic is escaping code and becoming editable runtime policy. You don't recompile when you want to change how an agent decides to retry, when to delegate to a child agent, or how strictly to enforce resource budgets. You edit a document. That's a meaningful shift from the way most agent infrastructure is built today — where harness changes require code changes, even when the underlying model stays fixed.

The numbers beyond task success bear that out. Traces from NLAH runs averaged 58.5 logged events per task, compared to 18.1 for native code traces. More events isn't inherently better, but in this context it reflects a more explicit decision trail — the IHR logging what it read, what it decided, and why. And about 90 percent of prompt tokens, completion tokens, tool calls, and LLM calls occur in delegated child agents rather than the runtime-owned parent thread. The parent thread manages; the children execute. That's a different architecture from systems where the harness runs monolithically in a single execution context.

The SWE-bench results are less dramatic than OSWorld but tell a consistent story. On 125 SWE-bench Verified samples, more than 110 of 125 stitched runs agree between the full IHR and each ablation — meaning the harness variation matters less than on OSWorld, but the process metrics still shift. The paper's interpretation: process metrics (tokens, calls, events) move more reliably than resolved-rate numbers across harness variants. If that's right, it's a useful signal for anyone building agent evaluation frameworks: resolved or not is a noisy signal; how the agent gets there is more diagnostic.

The connection to real-world agent engineering is direct. OpenAI published a case study on harness engineering on February 11, 2026, in which roughly 1500 pull requests were opened and merged by a three-engineer team using Codex agents over about five months, producing approximately 1 million lines of code with minimal hand-written scaffolding. The productivity story is in the harness, not just the model. McAteer's 50 percent estimate maps reasonably well to that data.

NLAH is not a product. The GitHub repository says the main code and instructions are planned for April 30 release — what exists there now is scaffold and documentation. The paper is a preprint, under review with no peer-reviewed publication yet. The benchmark subsets are small: 36 OSWorld samples and 125 SWE-bench samples, constrained by compute budgets. These are real limitations a reader should weigh.

What makes the paper worth paying attention to anyway is the specific architectural claim. Externalizing harness logic as natural language and running it through an in-loop LLM is a different bet than the dominant pattern of encoding control flow in code and prompts. Whether it scales — whether the approach holds on full benchmarks, with different models, in production systems — is an open question. But the OSWorld numbers are not noise. The 16.8-point gap on the same underlying model, with the only change being how the harness is structured, is evidence that harness design is doing real work.

The researchers are Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng from Tsinghua University Shenzhen and Harbin Institute of Technology (Shenzhen). All five authors. No industry affiliations. No disclosed funding sources in the preprint. Worth knowing as you weigh the claims.

Dan McAteer was right that day. Whether it's exactly 50 percent is unknowable. But the Tsinghua and HIT team just handed the field a data point that makes the claim harder to dismiss as practitioner intuition. Harness engineering is a design choice with measurable consequences. That's worth building on.

Stop Picking Better Models. Fix the Harness. — type0 | type0

Stop Picking Better Models. Fix the Harness.

Sources