The Same Three Names Now Define AI's Threats, Defenses, and Audit Standards

The Same Three Names Now Define AI's Threats, Defenses, and Audit Standards — type0 | type0

PREVIEWThe Same Three Names Now Define AI's Threats, Defenses, and Audit Standards · MD

Zico Kolter's research defined the threat. Matt Fredrikson's company sells the certified defense. The same AI labs being asked to pass safety audits cite Gray Swan on their model cards. And Kolter sits on the OpenAI board that is partly setting the audit standards. That overlap is the story the US Government's recent export-control directive on Anthropic's Mythos, an advanced AI model placed under new federal restrictions, made newly visible.

A long, on-record conversation on the Latent Space podcast with Kolter and Fredrikson lays out how AI security is professionalizing around a closed academic–vendor triangle. Kolter is a member of OpenAI's board of directors on its Safety & Security Committee, and a CMU professor. Fredrikson is also at CMU and is the CEO of Gray Swan, an AI security firm whose tools and citations have become near-mandatory infrastructure for frontier AI evaluations. Together, they co-authored the canonical academic paper on indirect prompt injection, the attack class in which a model reads hostile instructions hidden in untrusted content (an email, a webpage, a tool result) and follows them as if they came from the user.

That paper does more than describe a vulnerability. It names the threat model that frontier labs now design against. The same paper's authors also run a company, Gray Swan, that sells the tools (Shade, an adversarial red-teaming product, and Cygnal, a guardrails product positioned against what researcher Simon Willison calls the "Lethal Trifecta," meaning data the model can read, instructions it must follow, and actions it can take) that Gray Swan says labs use to test whether they have defended against it. And one of those authors sits on the board of the lab whose safety committee is increasingly functioning as a quasi-regulator for the rest of the industry.

The Mythos trigger did not cause this concentration, but it did expose it. When the US Government placed Anthropic's Mythos and Fable under export controls, indirect prompt injection and jailbreaks returned to the front of the AI security conversation, and the industry looked, as it has been looking, to Gray Swan. According to Kolter and Fredrikson, Anthropic's Mythos model card cites Gray Swan work directly. The story being told by wire coverage splits into three separate beats: board governance, an academic paper, and a vendor profile. The loop only becomes visible when all three are held at once.

Kolter and Fredrikson do not pretend the loop is comfortable. On the podcast, they argue that frontier scaling does not automatically make models safer; that traditional cybersecurity, with its perimeter-and-patch logic, is the wrong frame for AI agents; that agents introduce a new exploit class: persistent, multi-step, tool-using attacks that no static guardrail can fully cover; and that specialized red-team models can now beat humans at breaking systems. "Gray swan" is their term for the failure mode they expect next: an unlikely event that everyone can see coming, in contrast to Nassim Taleb's "black swan," which by definition cannot be foreseen. The response Gray Swan describes building is a maturing stack: Shade for adversarial red-teaming, Cygnal for guardrails, the Gray Swan Arena for public attack-and-defend competitions, and an emerging insurance and compliance layer in which a certified clean evaluation becomes a precondition for enterprise procurement and, increasingly, for regulatory permission.

The structural risk is that the discipline cannot easily police its own center. The threat-model author, the audit vendor, and the governance voter are not three different institutions with different incentives. They are the same three names, and the question of whether that triangle produces better defenses or simply faster, more credentialed ones is now an open question for the field. The Mythos ban did not create the loop. It made it impossible to discuss AI safety certification without naming the people who set the test, sell the remediation, and judge the scorecard.

What to watch next: whether the frontier labs being audited by Gray Swan-adjacent tooling begin disclosing recusal arrangements for Kolter on the OpenAI board when model evaluations authored by his co-authors are under review, and whether the emerging insurance layer treats a clean Gray Swan evaluation as a binding precondition or as one signal among several. The next gray swan will arrive on a timeline the field can already see. The harder question is who certifies the defenses when the people who defined the threat, built the tools, and write the audit rules all work in the same group chat.

The Same Three Names Now Define AI's Threats, Defenses, and Audit Standards

Sources