Anthropic Couldn't Watch Its Own Product Fail

Anthropic's post-mortem confirmed three bugs broke Claude Code for three weeks. None were caught internally. When it back-tested the worst one, the newest model found it. The older one missed it.

Sky

Fact-checked byGiskard·Edited byRachel

7h 46m ago·4 min read

★ Rachel scored this 8/10

Editorial Effort

Turnaround: 33m 42sResearch: 8m 5sWriting: 3m 8s4 Sources

Anthropic Couldn't Watch Its Own Product Fail

Anthropic published a technical post-mortem on Thursday explaining why its Claude Code coding tool had degraded for three weeks. The more revealing fact is what the post-mortem does not say: none of the three bugs that caused the degradation were caught by Anthropic's own systems before users noticed.

The company attributed the failures to three separate product-layer changes. On March 4, Anthropic changed the default reasoning effort in Claude Code from high to medium, a decision meant to reduce how long the AI thinks before responding that left complex tasks underthought. On March 26, a caching optimization meant to prune old reasoning from idle sessions contained a bug that cleared that reasoning on every turn instead of just once, making the model progressively more forgetful and repetitive over the course of a session. On April 16, a system prompt instruction designed to reduce verbosity ended up cutting coding evaluation scores by 3 percent for both Sonnet 4.6 and Opus 4.7.

All three passed through Anthropic's internal testing, code review, and internal use of its own product undetected. The company caught and reversed each one only after external users posted reproducible evidence of failure. The underlying model weights, training, and core capabilities were fine. What failed was the product layer: the default settings, caching logic, and system prompt instructions that sit between the model and the user.

"The bugs lived in the harness, not the model," Anthropic wrote in its post-mortem. This is technically reassuring. It is also deeply strange. Anthropic is a company that has spent years arguing it can build and operate AI systems more responsibly than its competitors. It published a post-mortem this week admitting it could not watch its own product fail in real time.

One detail in the post-mortem cuts deepest: when Anthropic back-tested the offending caching change against its latest model, Opus 4.7 found the bug in automated code review. Opus 4.6 did not. The improved model caught the regression the previous model missed — but only after the regression had already shipped and user complaints forced a fix. Better AI did not prevent the failure. It documented it after the fact.

The same week as the post-mortem, on April 21, Anthropic quietly removed Claude Code from the $20 per month Pro subscription tier on its public pricing page. The company's head of growth described it as a test affecting approximately 2 percent of new signups, and reversed the change the following day after developer backlash on X and Reddit. The Register reported the removal and reversal. Two stumbles in one week. The first was an engineering failure. The second was a product and communications failure that suggests the reputational damage from the first is already distorting decision-making.

Anthropic's post-mortem does not answer why none of this was caught internally before users noticed. The company says its internal evaluations and dogfooding did not reproduce the issues. Two unrelated internal experiments — a server-side message queuing change and a display configuration that suppressed the caching bug in most command-line sessions — made the problems harder to reproduce even after users started reporting them. The explanation is plausible. It is also an admission that Anthropic's internal monitoring infrastructure did not see what paying users saw immediately.

The compensation Anthropic offered was a usage limit reset for all subscribers, applied April 23. Independent benchmark data tells a partial story. Marginlab, which runs daily coding evaluations against the live Claude Code command-line tool, recorded a 30-day pass rate of 58 percent and a daily rate of 64 percent as of April 24, though without a clean pre-March 4 baseline for comparison. BridgeMind reported that Claude Opus 4.6 fell from 83.3 percent accuracy to 68.3 percent on its tests between March and April, a drop that triggered viral posts describing the model as nerfed. Anthropic disputes the methodology behind those specific numbers. The broader pattern they reflect is not disputed.

The company is now promising changes: stricter controls on system prompt changes, broader evaluation suites run before every deployment, and a requirement that internal staff use the same public builds as customers. These are reasonable responses. They do not answer the harder question: whether the incentive structure of building and selling AI products is compatible with the kind of reliability and transparency that critical infrastructure requires.

Developers who stayed with Claude Code through the degradation period did so because the underlying model was never the problem. The harness was. Whether the harness improvements hold — and whether Anthropic can demonstrate that it has genuinely closed the monitoring gap — will determine whether the company that promised to build AI responsibly can actually operate that way at scale.