The British Lab That Found AI Can Hack and Make Weapons Has One Secret: What Counts as Dangerous

The British Lab That Found AI Can Hack and Make Weapons Has One Secret: What Counts as Dangerous — type0 | type0

PREVIEWThe British Lab That Found AI Can Hack and Make Weapons Has One Secret: What Counts as Dangerous · MD

The UK AI Safety Institute coaxed a chatbot into handing over a recipe for making anthrax at home.

That result, described in reporting by the New York Times and confirmed by AISI researchers, is not the point of this story. It is the example. What the red team found is that safety guardrails — the filters meant to stop a model from generating dangerous information — can be broken with focused effort, sometimes in hours. What the episode illustrates is the gap between what AISI knows about frontier AI capabilities and what it has chosen to tell the labs it is supposed to be regulating.

AISI has published detailed accounts of its evaluations: which models it tested, what tasks it set them, how well they performed. It open-sourced the software stack, called Inspect AI, that runs the tests. It published a tiered framework that sorts model capabilities into categories from "limited" to "critical." It has produced a 60-page frontier AI trends report and individual evaluations of every major lab's flagship model over the past two years. In all, the institute has assessed more than 30 frontier systems.

What it has not published are the specific capability scores that would tell a lab whether its model is approaching a threshold that matters.

"We test. We evaluate. We find vulnerabilities," the institute said in its principles document. "We share safety insights with the AI developer." What it does not say is what score triggers a consequence, what capability level earns a warning, or what benchmark result would tell a company that its model needs more work before deployment.

The result is a structural information asymmetry. AISI knows where each lab's model sits relative to a line it has not drawn in a public place. The labs do not.

"It's the difference between a driving test where you know the passing score and one where the examiner keeps the rubric," said one person familiar with how labs experience the evaluation process, speaking without authorization to discuss internal deliberations. "You drive. You wait. You get a letter that says 'needs attention in cyber.' You do not know what score you needed."

The voluntary nature of AISI's access makes the gap sharper. OpenAI has declined to grant AISI direct access to its models for evaluation. Google, which has provided some access, has not received published scoring thresholds. Anthropic gave AISI early access to Claude Mythos Preview — the institute was the only non-American government organization to receive that access — but what it reported back to Anthropic is not public.

AISI's own public artifacts suggest the stakes are not abstract. In its evaluation of Claude Mythos Preview, the institute found the model could complete expert-level cybersecurity challenges at a 73 percent success rate — previously impossible for any AI system. It could execute a simulated 32-step corporate network attack from start to finish in 3 out of 10 attempts, the first model to solve that particular test at all. On cyber capability tasks that took early 2024 models 10 percent of the time to complete, current frontier models now succeed roughly 50 percent of the time — and the time horizon for reaching 80 percent reliability has been doubling every 4.7 months since late 2024.

In its frontier AI trends report, AISI said it had found vulnerabilities in every system it tested. Self-replication success rates rose from under 5 percent to over 60 percent in two years. The institute's own researchers noted in a May paper that as AI systems become more capable and more autonomous, oversight mechanisms that worked at earlier capability levels may stop working. "The thing that keeps me up at night," said Jade Leung, AISI's chief technology officer, "is the relative speed of AI technology compared to the institutions like governments that have to respond."

AISI has pushed back on the idea that it should publish specific thresholds. The argument, made in its published principles and echoed in testimony before UK parliamentary committees, is that publishing a passing score would let labs optimize for the test rather than build genuinely safer systems. A lab that knows "score above 70 on this cyber scale and you are good to deploy" might train specifically on that benchmark rather than address the underlying capability that makes the risk real. The institute's methodology is public. Its specific scoring criteria are not.

That reasoning has real support among some evaluation researchers. Benchmarks that become targets tend to stop measuring what they originally measured. Publishing rubrics invites gaming in a field where the stakes are national security.

But the counterargument is also sharp. Labs that do not know what AISI is looking for cannot prioritize safety work that would actually reduce dangerous capabilities. They submit blind, invest in whatever they guess matters, and hope the letter that comes back says "acceptable." Some labs have told associates they would welcome more specificity — that the uncertainty makes resource allocation harder, not easier. The institute's voluntary regime also means the labs most confident in their safety posture may be most willing to submit, while the labs whose safety cases are weakest — the ones AISI most needs to evaluate — decline access entirely.

UK parliamentarians have begun asking questions. In a February session, members pressed AISI officials on whether the government was considering mandatory disclosure of evaluation results and whether thresholds would eventually be codified into regulation rather than maintained as informal guidance. AISI's written response noted that the current voluntary approach preserved "frank dialogue" with labs and that publishing scoring criteria too early in the science would "stifle legitimate safety research." The government's formal position, as stated in its AI Opportunities Action Plan, supports AISI's continued independence but has not addressed the publication question directly.

What to watch next is whether the UK moves to formalize any of this. The current Labour government has shown more appetite than its predecessor for statutory backing for AI oversight, though no legislation mandating AISI threshold publication has been introduced. In the US, the parallel AI Safety Institute — funded at roughly $10 million this year against AISI's approximately £360 million (about $480 million) — operates under the same voluntary framework and has made the same methodological transparency choices. If one government publishes thresholds and the other does not, the difference in how frontier labs allocate safety resources could become its own competitive factor.

The anthrax recipe that AISI extracted from a chatbot is not the story. The story is that the institute extracted it knowing exactly what capability threshold it represented, and then filed that knowledge where the public cannot see it — not in a classified vault, not under executive privilege, but in a published evaluation that deliberately omits the number that would give the reader context for what the result means.

AISI knows what the passing grade is. The labs do not. That gap is where the accountability problem lives.

The British Lab That Found AI Can Hack and Make Weapons Has One Secret: What Counts as Dangerous

Sources