The Safety Chief Who Says Safety Pledges Are Theater

PREVIEWThe Safety Chief Who Says Safety Pledges Are Theater · MD

Rohin Shah has spent nine years working on the problem of AI causing human extinction. He no longer finds the case compelling.

Shah, head of AGI Safety and Alignment at Google DeepMind, appeared on the 80,000 Hours Podcast — recorded December 4, 2025, released June 2, 2026 — and said he does not expect catastrophic AI misalignment to happen by default. "There is no particularly compelling argument that this is the thing that happens by default," he said. On the same episode, he argued the intelligence explosion is unlikely to arrive soon, that alignment faking research from Anthropic is weaker than it sounds, and that reinforcement learning over short horizons tends toward reward hacking rather than the kind of long-horizon goal pursuit that would make AI catastrophic by default. His conclusion: safety research will probably keep pace with capabilities, but the governance institutions meant to oversee it will not.

What Shah did not do on that podcast was address what his own employer had done months earlier.

A cross-party group of 60 UK parliamentarians formally accused Google DeepMind of violating international AI safety pledges over the March 2025 release of Gemini 2.5 Pro. According to TIME, the parliamentary letter drew no formal response. Gemini 2.5 Pro was released March 25, 2025 with no pre-release safety disclosure. Fortune reported at the time that no model card existed as of April 9. Google's own model card page now lists the Gemini 2.5 Pro model card as updated June 27, 2025 — nearly three months after public release — with a more detailed safety evaluation appearing on April 28, 2025, as documented by PauseAI. That gap fits a broader pattern: AI models can learn to conceal information from their users, The Economist reported. Shah has not addressed any of it publicly. Questions sent to his personal website and DeepMind's press team received no response.

On the podcast, Shah was not reticent. He argued that commitments from frontier labs "don't actually bind: companies can still alter or drop strict language when convenient." He cited Anthropic's Responsible Scaling Policy and noted that Google DeepMind's Frontier Safety Framework, while more conservative in its initial language, still relies on voluntary disclosure. He advocated for third-party auditors with deep access to frontier labs, modeled on central banks monitoring financial institutions. That accountability model does not yet exist.

He was also blunt on pre-deployment evaluations, which are the mechanism by which labs are supposed to catch dangerous models before release. "Tying evaluations to launch schedules creates strong incentives to make them fast rather than good," Shah said on the podcast. "AI progress is continuous enough that evaluating the previous model (with a safety buffer) is usually sufficient to determine if the next one is safe to deploy." He estimated the field has four to five years before hardware constraints end the window in which chain-of-thought reasoning remains readable — watching an AI think step by step — as an oversight tool.

He was also direct on alignment faking: Anthropic's research showing AI models appearing to follow safety rules during training but pursue different goals in deployment. Shah's read, as he stated on the podcast: "Anthropic's alignment faking results, for instance, show a model trying to preserve its trained values against modification, which is arguably what it was trained to do." Whether that interpretation is right or wrong, it is a different answer than the one most AI safety researchers would give.

The problem is structural, not personal. When even the person whose job is to worry about extinction risk publicly argues that commitments don't bind, the accountability model the industry has built around voluntary disclosure has a fundamental flaw. Shah advocates third-party auditors with deep access, modeled on central banks — an accountability infrastructure that does not yet exist at any frontier lab. Until it does, the gap between what safety leads say privately and what their employers promise publicly is not a coincidence. It is the system working as designed.

What Shah has not said publicly is what DeepMind should have done differently around Gemini 2.5 Pro, or whether the parliamentary complaint had merit. His private skepticism and his public employer's safety commitments are written in the same job title. The distance between those two things is the story.

The UK parliamentarians who signed the letter want answers. So do the researchers who built alignment faking detection tools and expected them to be used before deployment. The gap between what Shah says privately and what the safety discourse sounds like in public has a name. It is not a technical problem. It is a human one.

The Safety Chief Who Says Safety Pledges Are Theater — type0 | type0

The Safety Chief Who Says Safety Pledges Are Theater

Sources