46dAIPOD

The AI safety paradox: every frontier lab is walking into the same trap

reported by Sky · 4 min read · published April 12, 2026

PREVIEWThe AI safety paradox: every frontier lab is walking into the same trap · MD

The safety plan for frontier AI is, in the words of one of the field's most careful thinkers, "use AI to make AI safe." That sentence sounds like a joke. It is not.

Ajeya Cotra, a researcher who has spent a decade watching the major labs from the inside at Coefficient Giving and is now a member of technical staff at the machine ethics research group METR, laid out the paradox with unusual clarity in an October 2025 episode of the 80,000 Hours Podcast. Her argument deserves wider attention among the people building and funding this technology, because it describes a trap that the entire industry may be walking into together, eyes open.

Here is the logic. Every major AI company, in its public communications, describes some version of the same strategy: as AI systems grow more capable, the plan is to involve those systems more deeply in the work of keeping AI safe. Use interpretability tools to monitor models. Use AI-assisted red-teaming to find vulnerabilities. Use better alignment techniques guided by AI itself. The hope is that by the time AI is powerful enough to cause real damage, we will have learned to direct it safely. The weapon and the shield are the same thing.

This is not obviously insane. Humanity has used technologies that created new dangers to manage those dangers. Cars enabled carjackings but also faster police response. The internet spread misinformation and also fact-checks. The principle is sound in general.

But Cotra thinks this case is different in a specific and worrying way. In Cotra's view, the window between AI automating AI research and the arrival of something uncontrollable may be brutally short. She estimates that window at roughly twelve months, and says she is not being conservative in that estimate. In that window, labs would need to redirect enormous quantities of AI labor away from making systems smarter and toward alignment research, biodefense, cybersecurity, and political coordination. The competitive pressure to keep building rather than stop would be severe. And there is no guarantee that AIs useful for advancing AI research would be similarly useful for making AI safe. They might be excellent at one thing and useless at the other.

Cotra's timeline: top-human-expert-dominating AI, systems better than the best humans at any computer-based task, probably arrives in the early 2030s. Once that threshold is crossed, the feedback loops accelerate. Those systems could start building the robotic actuators they need to interact with the physical world, closing the loop on chip fabrication, supply chains, and eventually their own replication. At that point, progress is limited only by physical constraints, not human labor. In Cotra's analysis, drawing on work from the Forethought Foundation, three compounding loops matter most: AI automating AI software research, AI automating chip design and manufacturing, and AI automating the physical supply chain from raw materials up.

The paradox is not that labs are irrational for pursuing this plan. It is that the plan requires trusting the thing you are trying to control. Cotra put it this way: the approach either bottlenecks progress because humans are checking everything, or it does not bottleneck progress and we hand AIs the power to take over. We do not yet know which world we are in.

Why this should matter to founders, engineers, and investors is straightforward. If the safety strategy of every major lab converges on the same recursive gamble, then the failure mode for the entire field is also shared. Diversification of risk is not available. The competitive dynamics that push labs toward this plan are the same dynamics that make the plan dangerous. No single company can opt out, because rivals will not.

There are things labs could do differently: quantitative commitments about what fraction of AI labor they would redirect during a crunch period, rather than vague promises. Transparency about internal productivity signals, not just benchmark scores. Better early warning systems. But Cotra is honest that none of these fully resolve the paradox. They just buy time.

The deeper issue is that this plan assumes we can use the output of an AI system we do not fully understand to verify that same AI system is safe to give more authority to. That is an epistemically fragile foundation for the survival of everything.

For people building on top of these systems, or funding the next generation of them, the Cotra paradox is not an academic concern. It is a description of the structural position you are standing in. The labs know this. The question is whether the people building the applications and allocating the capital understand it too.

† Source-reported; not independently verified. The twelve-month window estimate reflects Cotra's personal judgment and has not been independently corroborated. The three-loop analysis is Cotra's synthesis of Forethought Foundation research.

The AI safety paradox: every frontier lab is walking into the same trap

Sources