The demo works. The production agent dies silently at 2 a.m., and the reason has nothing to do with the model's intelligence.
The "context" in question is the full set of tokens the model sees at inference: system instructions, tool definitions, MCP server payloads, retrieved documents, prior turns of the conversation, and the response from every tool the agent has called. For a one-shot question, that list is short. For a long-running agent operating over a real codebase, a Notion workspace, or a customer support inbox, the list can swell past a million tokens in a few steps. Most enterprise agents do not have a million-token context window. The ones that do, as StackOne reported from a Factory.ai analysis published in August 2025, find that indiscriminately filling that window with retrieved material degrades reasoning quality rather than improves it.
This is the bottleneck almost nobody is talking about. Capability has advanced faster than the substrate that has to carry it. Anthropic, the AI company behind Claude, published a post on its engineering blog on September 29, 2025 that frames the situation plainly: context is a finite resource, and the engineering question is how to spend those tokens. Anthropic calls the discipline that answers that question "context engineering," and argues it is the natural successor to prompt engineering for agent-shaped work. The problem outgrew single-turn prompts; the discipline has to grow with it.
The clearest statement of the gap between agent demos and shipped product comes from a March 14, 2026 analysis by Digital Applied, a software engineering services firm, which synthesizes two industry surveys: a Gartner 2025 figure that 85% of AI projects fail to reach production, and a McKinsey 2025 State of AI finding that fewer than 20% of AI pilots scale within 18 months. Digital Applied's read of those numbers is that an estimated 88% of AI agent projects never reach production, with an average direct abandonment cost of around $340,000. Both underlying surveys are at one remove, and the 88% / $340K framing is Digital Applied's synthesis rather than a primary statistic, but the pattern lines up with the qualitative accounts below.
The most vivid of those accounts comes from StackOne, a SaaS integration vendor, on January 20, 2026, in a post that gave the failure mode a name: "agent suicide by context." The pattern StackOne describes is an agent taking a reasonable action, like fetching a Notion page, and getting back a 200,000-token response because the page embeds several large databases. The agent then uses that response as input to its next tool call. The second tool's output is bigger. The third is bigger still. The context window fills, the next inference is truncated or rejected, and the agent silently stops making progress on the user's actual request. There is no thrown error. The model did exactly what it was asked. The harness let it drown.
The fix is a different shape of harness. Three patterns are coming together as the infrastructure layer that the agent demos of 2024 were quietly missing.
The first is sub-agent isolation. Instead of one long-lived agent accumulating state, the work is split across a small fleet of short-lived sub-agents, each with its own focused context window. Anthropic's post makes this the headline pattern: each sub-agent gets only the system prompt, tools, and intermediate state it actually needs, and the parent agent orchestrates them by reference rather than by ingesting their full transcripts.
The second is compression, which Anthropic calls "compaction": the periodic summarization of older turns, tool outputs, and intermediate reasoning so they can be re-injected as a single condensed block, freeing the active window for new work. Anthropic's own production Claude agent, the post notes, uses compaction to operate over long horizons without losing the shape of the task.
The third is what a preprint posted to arXiv on November 27, 2025 (2511.22729) calls a "memory pointer" pattern. Rather than copying a large tool response into the LLM's context at all, the agent stores the response in a local file or object store and passes the model a pointer (a path, an identifier, a tiny stub) that it can later resolve with a small retrieval call when it actually needs the data. The pattern is framework-agnostic. The paper's authors tested it in a Materials Science workflow where a naive agent design consumed 20.8 million tokens and failed, while the memory-pointer version used 1,234 tokens, a roughly 16,000x reduction, with the paper also reporting about a 7x reduction in a separate comparative experiment. The result is a single case study, not a general benchmark, and the paper is a preprint without established peer review. It is also, in absolute terms, the most concrete quantified result in the new infrastructure layer.
The pattern is already shipping in production code. A post on AWS's dev.to blog walks through a working implementation using Strands Agents, the open source agent framework Amazon released, that keeps a 145KB response out of the LLM's context window entirely. The same post notes that the pattern works with LangGraph, the agent framework from LangChain, and AutoGen, Microsoft's multi-agent framework, and links to a published code sample under aws-samples/sample-why-agents-fail, titled with the failure mode it is designed to prevent.
The picture that emerges is a quiet shift in where the engineering effort for production agents actually goes. The model is the easy part. The hard part is the harness around the model: how state is partitioned across sub-agents, when and how it is compressed, and which results are pointers versus which results are copied into the active context. Anthropic's post, the arXiv preprint, the AWS implementation, and StackOne's taxonomy of failure patterns are all reaching for the same underlying insight: prompt engineering, on its own, was the wrong abstraction for a workload where the model accumulates state across many tool calls and many turns, and where the budget for that state is a hard, finite number of tokens per inference.
The thing to watch over the next two quarters is whether the framework vendors start shipping these patterns as defaults. Strands Agents, LangGraph, and the Anthropic SDK all expose some version of sub-agent isolation, compaction, and memory pointers today. None of them make those the default for a freshly scaffolded agent. When that changes, the 88% number is the one that will move. Until then, expect more late-night incidents where a working demo and a broken production agent differ by nothing but the size of the context the agent had to carry.