The White House told Anthropic to make its AI jailbreak-proof. The math says it can't.

PREVIEWThe White House told Anthropic to make its AI jailbreak-proof. The math says it can't. · MD

When Anthropic pulled its most advanced model offline last week under U.S. export-control pressure, administration officials framed the move as a routine national security fix. The company could rerelease the model, WIRED reported, but only after addressing alleged vulnerabilities that officials said left the system open to "jailbreaks," the term for prompts that trick a model into ignoring its safety rules.

The problem, per WIRED's Inner Loop newsletter, is that the administration's demand has shifted from arguing about the significance of the jailbreak risk to ordering a remediation that the government's own technical staff has reportedly concluded may not be achievable. The NSA, WIRED reported, has internally concluded that there are ways to disable guardrails on the model, guardrails meant to restrict the underlying system from producing cybersecurity, chemistry, and biology content that could be misused.

The dispute now centers on what the administration has framed as "Anthropic's problem to fix." A Monday technical meeting between the company, the Commerce Department, and the Office of the National Cyber Director, led by Sean Cairncross, became the venue for that framing. According to WIRED, Anthropic has argued publicly that the jailbreak concerns are overblown and that the practical effects are minimal, a position the company reiterated in that meeting.

The gap between the political demand and the technical reality is the story. "Block every jailbreak" reads as a remediation checklist item. In practice, resistance to jailbreaks on a frontier model is a probabilistic property: it can be measured, improved, and stratified by use case, but it cannot be reduced to zero without also reducing the model's general capability to a point that defeats the purpose of shipping it. That is a research-community consensus, not a vendor talking point, and it is why AI labs typically publish model cards with known limitations and red-team results rather than safety guarantees.

Export controls are a familiar lever for dual-use technologies, and they are not new to AI. The U.S. has previously used them to force model downgrades before international releases. The current episode pushes that lever into new ground. Past moves changed what a model could do at the weights level. The current demand would, in effect, require a lab to demonstrate that a deployed model cannot be steered into prohibited outputs by any adversarial prompt, a bar that does not appear in any frontier model safety report from any major lab, including Anthropic's own Claude system documentation.

If the administration holds the line, the most likely outcomes are a delayed rerelease, a model with narrowed capabilities, or a public fight about what counts as "addressed." If it relaxes, it will signal that export controls cannot enforce a jailbreak-immunity standard even when the technical community has told them so. Either result tells the policy world something it has not had to say out loud yet: for frontier AI, the export-control playbook is now negotiating with a moving target whose safety properties are statistical, not binary.

The White House told Anthropic to make its AI jailbreak-proof. The math says it can't. — type0 | type0

The White House told Anthropic to make its AI jailbreak-proof. The math says it can't.

Sources