US signs AI evaluation deals with Google DeepMind, Microsoft, xAI
The US government signed evaluation agreements with Google DeepMind, Microsoft, and xAI on Tuesday, formalizing its first concrete steps toward pre-release vetting of frontier AI models — while Kevin Hassett says the administration is actively studying a mandatory executive order modeled on FDA drug approval. The agreements, announced by CAISI at the Department of Commerce's National Institute of Standards and Technology, represent the first formal evaluation partnerships with major frontier labs under what was, until recently, a largely dormant oversight body.
The shift is stark. Six weeks ago, the administration fired its own AI safety director — Collin Burns — within days of appointing him, because nobody in the White House had checked that he previously worked at Anthropic, the same lab whose models are now subject to the new evaluation agreements. Congress had approved $55 million for NIST AI research and up to $10 million for CAISI expansion just four months earlier.
What drove the reversal is not a change of philosophy. It is a model.
Anthropic's Mythos — a cyber evaluation system built to find software vulnerabilities — found nearly 300 in Firefox alone, roughly 20 times what an earlier Anthropic model found in the same target. It uncovered a 27-year-old unpatched bug in OpenBSD, an open-source operating system considered security-critical. Mythos achieved 181 working exploits out of 200 autonomous attempts against a Firefox benchmark. The prior flagship model, Opus 4.6, achieved near 0% autonomous exploit rate on the same test. The jump in capability — from finding vulnerabilities to autonomously weaponizing them — is what regulators no longer can ignore.
Independent experts have treated the findings as credible, if sourced carefully. "The oldest we have found so far being a now-patched 27-year-old bug in OpenBSD," Anthropic's red team reported. Thousands of high-severity vulnerabilities across every major operating system and browser are now documented and publicly disclosed — most still unpatched. The NSA, according to Bloomberg, is already testing Mythos internally against Microsoft software and widely used codebases. That puts Microsoft in the unusual position of being simultaneously a CAISI evaluation subject, a Mythos launch partner, and one of the software vendors whose code the NSA is probing with the same model.
CAISI says it has completed more than 40 evaluations of frontier AI models, including on state-of-the-art systems that remain unreleased. The agreements signed Tuesday formalize a voluntary evaluation framework — the labs consent to be assessed, and CAISI publishes what it finds. But voluntary frameworks have a ceiling. Independent analysts note CAISI operates on roughly one-tenth the staff and budget of the UK's equivalent body, the AISI, which has for years been conducting the kind of adversarial model evaluation the US is now attempting to institutionalize.
The administration is not moving from conviction. It is moving from evidence it could no longer dismiss. Hassett's EO study remains conditional — a roadmap, not a requirement, and one that faces the same voluntary-participation problem as the agreements themselves. But the documents on both the CAISI evaluations and the Mythos findings are now public. The argument that AI models pose no national-security risk requiring oversight is harder to write in a policy memo when the NSA is using the same model to probe commercial software.
What to watch: whether the EO draft survives the voluntary-framework ceiling when labs push back, and whether the next Mythos-level capability jump makes even the current evaluation agreements look too slow for the threat window.