How we Build Evals for Deep Agents
When LangChain wanted to understand why their coding agent was failing, they turned to their own eval loop first. Not a new model — the harness. The scaffolding that connects the model to the task environment. The company has been documenting the results in a two-part blog series, and while the benchmark numbers are getting the attention, the methodology is what actually holds up to scrutiny.
LangChain claims their deepagents-cli scored 66.5 percent on Terminal Bench 2.0, a public benchmark with 89 tasks spanning machine learning, debugging, and biology. But that figure does not appear in the leaderboard's top 25 entries — GPT-5.2-Codex shows 0.640, or 64 percent, in fifth place on the public board. The 52.8 percent baseline LangChain cites as a starting point is not independently verifiable; historical scores are not archived. The post does not isolate which harness changes drove the gain. Secondary coverage calling this a Top 5 finish is wrong — the score as reported is not visible in the visible leaderboard range. What LangChain has produced is a methodology, not a provable benchmark result, and the methodology is the more durable story.
The eval taxonomy LangChain uses to drive that loop is worth reading carefully. Six categories — file_operations, retrieval, tool_use, memory, conversation, and summarization — map to the concrete failure modes their team encounters in daily use. The five metrics that follow are equally concrete: correctness, step ratio, tool call ratio, latency ratio, and solve rate. This is not evaluation theater. It is a framework for identifying where agents break, built from actual production errors.
"We dogfood our agents every day," LangChain's team wrote on their blog. "Every error becomes an opportunity to write an eval and update our agent definition and context engineering practices." That production-error-to-test-case loop is the methodological core of the piece — and, it must be said, a rather elegant recursive structure: an AI company using AI to evaluate AI, with the evaluation criteria growing as the AI makes mistakes. It is also, crucially, auditable by anyone who wants to replicate it.
LangChain runs these evals in pytest via GitHub Actions, so changes to the agent definition trigger a clean, reproducible evaluation run in CI. That is standard practice for software infrastructure, unusual for agent development workflows, and worth noting: it means LangChain is treating agent behavior as a deterministic property to be tested, rather than a probabilistic one to be monitored. Single-step evals — meaning the model fails in the first turn — constitute roughly half their internal test suite. That is both encouraging, because first-turn failures are easy to diagnose, and humbling, because it means the model is not getting far before something breaks.
The meta-issue underneath all of this is grading. Anthropic documented the problem precisely: Claude Opus 4.5 initially scored 42 percent on CORE-Bench, but the issue was evaluator rigidity — expecting 96.124991 when the answer was 96.12 — not a model failure. Anthropic uses three grader types: code-based, model-based, and human. The plurality of approaches reflects a genuine epistemological problem: when the task is writing code that fixes a failing test suite, the test suite has to be unambiguous. Many agent benchmarks are not. This is the implicit context for any harness result, including LangChain's — benchmarks measure what they measure, and the measurement itself is often the variable.
What powers Fleet and Open SWE
Deep Agents is the open-source harness that powers LangChain Fleet and Open SWE. Open SWE maps architectural patterns from Stripe Minions, Ramp Inspect, and Coinbase Cloudbot, and ships with approximately 15 curated tools: execute, fetch_url, http_request, commit_and_open_pr, linear_comment, slack_thread_reply, plus Deep Agents built-ins like read_file, write_file, edit_file, ls, glob, grep, and write_todos. The tool count is deliberate. Open SWE is not a general-purpose agent — it is a focused coding agent with a constrained surface area. The decision to limit scope rather than chase tool breadth is a bet that reliability beats capability coverage, at least for internal deployment contexts.
SWE-bench Verified — which gives agents GitHub issues from popular Python repositories and grades solutions by running the test suite — has seen LLMs progress from 40 percent to over 80 percent in one year, per Anthropic's data. The harder question is whether the benchmark is keeping pace with real-world task complexity, or whether the easy gains have already been harvested.
The read
LangChain has produced a methodology post dressed up as a benchmark story. The benchmark numbers are murky — the 66.5 percent score LangChain claims is not visible in the top 25 entries on the Terminal Bench 2.0 leaderboard, the 52.8 percent baseline cannot be independently verified, and the Top 5 framing in secondary coverage is simply wrong. What the posts actually offer is an eval taxonomy and a daily dogfood loop that other teams can study and replicate.
For builders, the takeaway is architectural: if you are evaluating agent infrastructure and the conversation starts with which model to use, you are asking the second-order question first. The first-order question is what your harness is doing to the model's context, retries, and tool call formatting — and whether you have a systematic way to measure changes to it. The eval taxonomy LangChain uses — six capability categories, five metrics, pytest in CI — is a starting framework for exactly that.
For investors, the signal is different: the moat in agent infrastructure may increasingly live in the harness layer, not the model layer. The differentiation is in the operationalizing — the test suites, the CI integration, the eval taxonomy, the daily dogfood loop — not the API call. SWE-bench Verified's 40-to-80-percent trajectory suggests the easy gains in raw model capability have mostly been harvested. What remains is the engineering around the model, and that engineering is increasingly measurable.