Affirm spent $200,000 on AI coding tools and one week of engineering time. What broke wasn't the AI — it was the automated test-and-build system (CI pipeline) the AI was supposed to work inside.
Enough AI agents submitting code-review submissions (pull requests) in rapid succession drove up Buildkite queue wait times. Weeks after the rollout concluded, the load brought down Affirm's internal test quarantine service and the broader automated-testing infrastructure, according to the company's own post-mortem account on its engineering blog. The bottleneck wasn't the AI writing code — it was the twelve-year-old monorepo that couldn't absorb the velocity the agents unlocked.
"The single most-cited friction point in the engineering-wide survey we ran was our change review process," the team wrote, with roughly 40% of respondents raising it unprompted. Full end-to-end regression test suites on ephemeral environments took "an excruciating" 100 or more minutes. That was true before the agents arrived.
What Affirm discovered is that AI coding adoption is front-loaded in infrastructure, not tools. The company set a token budget of roughly $200,000 for the week — about $250 per engineer — and came in under, at around 70% of budget. The AI itself was cheap. Fixing the CI and review bottlenecks to handle what the AI could produce cost considerably more in engineering time.
Claude Code was the chosen instrument. Affirm's team built its entire AI-assisted workflow against that tool's primitives, standardizing on a single default toolchain rather than letting engineers fragment across multiple options. The bet was that a constrained, well-supported setup would get higher adoption than a flexible, anything-goes one.
The early evidence supports that call. But the infrastructure failure is a warning for any engineering organization treating AI coding rollout as a tool problem rather than a systems problem. You can give every engineer an agent tomorrow. What you cannot do tomorrow is rebuild the CI pipeline, fix the test suites, and retrain the reviewers to keep up with code arriving at machine speed.
What to watch: whether Affirm's 58% PR volume gain holds as the infrastructure rebuild continues, and whether the gap between what the agents produce and what the human review process can absorb ever closes. If it doesn't, the agents will keep winning and the bottleneck will keep moving downstream.