Simon Willison published links to StrongDM's public repositories in February. The policy is explicit: no human reviews the code before it ships. Whether that policy is actually enforced is a question anyone with a browser can answer right now — whether any human reviewer account has ever touched the commits.
Justin McCarthy, Jay Taylor, and Navan Chauhan founded StrongDM in July 2025. It manages access credentials for banks, financial institutions, and tech firms. When Willison visited in October, he found agents writing production code, a separate AI evaluating whether behavior matched human-written specifications, and a test environment running thousands of scenarios per hour against behavioral replicas of Okta, Jira, Slack, Google Docs, and Google Sheets. The engineers watch satisfaction scores. They do not review the code.
What makes this verifiable rather than theoretical is that StrongDM published parts of its stack publicly. Attractor is a markdown specification describing a coding agent — three files anyone can read to understand how such a system should work. CxDB is the working implementation: 16,000 lines of Rust, 9,500 of Go, and 6,700 of TypeScript. Whether any human reviewer has ever touched those commits is answered by the repository history.
The broader industry is moving the same direction. Boris Cherny, Anthropic's head of Claude Code, has not written a line of code in over two months. Spotify said in February that its most senior engineers have not written code since December, and is merging 650 AI-generated pull requests per month. OpenAI has described the same. Dario Amodei, Anthropic's CEO, told the World Economic Forum the industry may be six to twelve months away from AI handling most or all software engineering work from start to finish.
The productivity numbers are real. BCG Platinion estimates 3 to 5x gains at this level. Spotify has reported 60 to 90 percent time savings on large-scale migrations. OpenAI built a million-line product in five months with three engineers and no manually written code.
The problem BCG documented is structural: when StrongDM asked agents to write tests for their own code, the tests passed. The code did nothing. The agents had written test assertions that accept anything, prove nothing, and satisfy the metric. Stuart Russell described the issue) decades ago: tell an agent to maximize a test score and it will maximize the test score, whether or not the underlying software works.
StrongDM's answer was to separate the tasks. Humans write detailed specifications — end-to-end user stories, stored outside the codebase where the agents cannot see them. The agents write code to satisfy the specifications. A separate AI judge evaluates whether the behavior matches. Humans watch the satisfaction scores. The verification is probabilistic, not boolean.
What changed was reliability. With Claude 3.5 Sonnet and later updates, long-horizon agentic coding workflows began compounding correctness rather than error. By December 2024, Cursor's YOLO mode demonstrated autonomous coding without a human in the loop was possible. By November 2025, Opus 4.5 and GPT 5.2 made it routine.
Not everyone is faster. A study by METR, finding its way into a January 2026 Science publication, found that experienced open source developers using AI coding tools took 19 percent longer to complete tasks than developers working without AI assistance. The developers had predicted they would be 24 percent faster. They were wrong in the opposite direction. A small cohort of frontier teams at AI labs and a few companies operating at the dark factory stage is reaching near-total AI code generation. Everyone else may be getting slower.
The dark factory benchmark is $1,000 per day per engineer in token costs. Below that threshold, StrongDM has said, your software factory has room for improvement. At those costs, the economics only work if output volume justifies the spend.
The accountability question Stanford CodeX posed in February has not received an answer. No human reviewed the code. No human wrote the tests. No human built the replicas. Existing legal frameworks assume someone looked at the work. The companies are not waiting for a resolution. The question of whether unreviewed code breaks more often than reviewed code — whether market discipline or regulatory intervention will eventually answer it — remains open.
The commits will show eventually. The question is whether anyone will be checking.