The Government Said GPT-5.5 Could Hack Like an Expert. Then It Couldnt Finish Checking If the Safety Fixes Work
When the UK AI Security Institute published its evaluation of OpenAIs GPT-5.5 on Thursday, the headline numbers were striking enough: 71.4 percent average pass rate on expert-level cybersecurity tasks, putting the model within the margin of error of Anthropics unreleased Mythos Preview the only other AI system to complete a simulated corporate network attack from scratch. A 12-hour reverse-engineering challenge that would take a human most of a workday solved in ten minutes and twenty-two seconds, for one dollar and seventy-three cents in API costs.
But buried in the same report was a different, quieter finding, and it is the one that matters most.
The UK AI Security Institute, the body Parliament created to evaluate frontier models before public deployment, identified a universal jailbreak on GPT-5.5s cyber safeguards within six hours of expert red-teaming. OpenAI subsequently issued patches. Then AISI hit what it called a configuration issue in the version it was given and could not verify whether the final safeguards actually worked before the model went live to paying subscribers on April 23rd.
OpenAI shipped the model anyway.
We have no way to know if GPT-5.5 is actually safe to release, wrote Shakeel Hashim at Transformer News. All we have to rely on is OpenAIs word.
The gap between what a government evaluator found and what it could confirm is the story. Every other outlet will lead with the benchmark numbers. Those numbers are real, but they are also the safe part to write. The verification failure is the part that matters.
The benchmark picture is genuinely notable. AISI rates GPT-5.5 as potentially the strongest model it has tested on narrow cyber tasks, narrowly ahead of Mythos Preview on expert-level work (71.4 percent versus 68.6 percent, within the error margin). On The Last Ones, a 32-step corporate network attack simulation AISI built with SpecterOps to mirror real intrusion kill chains, GPT-5.5 completed the full chain in two of ten attempts. Mythos Preview managed three of ten. No other model had solved it at all until those two.
The rust_vm challenge offers the sharpest illustration of what the capability means in practice. A stripped Rust binary implementing a custom virtual machine, paired with bytecode in an unknown format guarding a port-8080 authentication service. Crystal Peak, the cybersecurity firm that built the task, estimates its expert playtester needed twelve hours using Binary Ninja, gdb, Python, and an SMT solver. GPT-5.5 cracked it autonomously in ten minutes. AISI describes the models solve in technical detail, including how it recovered the instruction set architecture from relocation entries when the jump table addresses were zeroed out in the PIE binary, built a Python emulator to validate register state, and reverse-engineered the three-table checksum chain to recover the password. The work is real and the work is not trivial.
AISI frames the result as part of a broader pattern rather than an isolated breakthrough: if cyber-offensive skill is emerging as a byproduct of more general improvements in long-horizon autonomy, reasoning, and coding, the institute wrote, we should expect further increases in cyber capability from models in the near future, potentially in quick succession.
The comparison to Anthropics approach is instructive, if uncomfortable for OpenAI. Anthropic evaluated Mythos internally, judged it too dangerous for public release, and kept it locked down. OpenAI found essentially equivalent cyber capability and released GPT-5.5 commercially to Plus, Pro, Business, and Enterprise users. The inconsistency in safety posture between the two leading labs is now visible in public record, and neither company nor regulator has a clear framework for who decides which result warrants containment versus release.
AISI itself notes it was evaluating an early checkpoint of GPT-5.5, that its testing is scoped to what an agent could do when directed toward specific vulnerable targets where it already has network access, and that the evaluation does not account for active defenders or hardened production environments. The model is powerful. The exact threat it poses in the wild remains uncertain.
What is not uncertain is that the oversight mechanism designed to catch exactly this situation produced an incomplete answer and the model shipped anyway. OpenAI runs its own safety evaluations, publishes a system card, and invites third-party red-teaming. That process found a serious problem. It then failed to confirm the fix before deployment. The system worked exactly as designed, which is the problem.
The Trusted Access Programme, through which OpenAI provides elevated API access to vetted partners, is not described in detail in any public document. The company announced a public bug bounty for universal jailbreaks alongside the GPT-5.5 release. It is not clear whether any submissions have been received or evaluated.
OpenAI declined to comment on the configuration issue AISI cited or whether it would provide a corrected version for re-testing.
GPT-5.5 is real, powerful, and commercially deployed. Whether its safeguards are actually effective against the attacks they are designed to block remains, at least in part, a matter of faith.