Discovery vs. exploitation: a benchmark-first audit of Mythos' cyber claims

Discovery vs. exploitation: a benchmark-first audit of Mythos' cyber claims — type0 | type0

PREVIEWDiscovery vs. exploitation: a benchmark-first audit of Mythos' cyber claims · MD

Anthropic claimed Claude Mythos Preview demonstrated a leap in cyber capabilities—specifically, "a striking ability to spot vulnerabilities and work out ways to exploit them". The company backed this with a $100+ million Project Glasswing initiative to "secure the world's most critical software." A June 11 audit by Timothée Chauvin, Alexander Barry, JSD, and Anson Ho for Epoch AI's Gradient Updates—an informal author-opinion sub-series, not the organization's institutional view—compiles the public evidence and finds a more textured picture: benchmark gains on cyber tasks, modest improvement in the follow-up Claude Mythos 5 release, and peer-model performance that complicates the "massive leap" narrative.

The right way to read the claim is to separate two things coverage tends to fold together. Vulnerability discovery is the act of finding a software flaw, such as locating a buffer overflow in a server's input handling. Exploit development is the separate, harder task of crafting the input or chain of steps that weaponizes that flaw. A model can score well on one without crossing the threshold on the other, and the public benchmarks currently in circulation measure the first more cleanly than the second.

The Epoch AI audit walks through what is actually visible. The Cyber-ECI—an Epoch Capabilities Index variant for cyber tasks—places Mythos Preview roughly seven months ahead of the linear trend since early 2025, compared to GPT-5.5's two to three months ahead of schedule. Specific benchmarks cited include CVE-Bench, ExploitBench, ExploitGym, Cybench, and UK AISI CTF challenges. On CVE-Bench, Mythos Preview substantially outperformed prior models in developing exploits that achieve code execution without prior vulnerability knowledge.

Claude Mythos 5, the follow-up release, improved on cyber benchmarks only modestly—a fact the audit treats as a data point rather than a verdict. Skeptics cited in the piece point to GPT-5.5 landing roughly on par with Mythos Preview across a range of cyber benchmarks, even though its launch did not produce a documented cyber catastrophe, an observation that undercuts the leap-forward framing more than the raw numbers do.

The claim-versus-evidence question gets sharper when you fix the capability under audit. If the claim is that Mythos finds vulnerabilities other models miss, the benchmark evidence requires named models, named benchmarks, named score deltas. If the claim is that Mythos produces working exploits at a rate or quality no prior model has matched, the public evidence thins out. Most public cyber benchmarks test the discovery side. Exploit development, especially against hardened, real-world targets, is tested less often and reported less consistently, which leaves a measurable gap between a published score and a real attack.

The audit anchors its findings in both benchmark scores and qualitative evidence from companies that participated in Project Glasswing. Mozilla considered Mythos Preview as good as elite security researchers, though it didn't unearth entirely new vulnerability classes. Palo Alto Networks said frontier models accomplished "the equivalent of a full year's worth of penetration testing effort" in under three weeks. Cloudflare and Palo Alto Networks both noted how Mythos Preview could chain low-severity bugs into high-severity exploits. AWS claimed Mythos Preview was better than previous models and helped identify additional hardening opportunities.

But the audit also surfaces contrary evidence. The curl code library—one of the world's most heavily audited codebases, which already used multiple AI scanners—provides a revealing test case. Mythos Preview found one low-severity vulnerability alongside four false positives. The curl lead maintainer's verdict: "I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos." The startup AISLE claims even small open models can recognize several of the vulnerabilities Anthropic showcased from Mythos Preview.

That context reframes what the numbers mean. Vulnerability discovery may have been competitive before Mythos Preview; the practical advantages of Mythos Preview in this domain appear more concentrated in lower false positive rates, better severity assessment, and superior exploit development that helps verify whether a detected weakness is real.

On exploit development specifically, the audit's case strengthens. Anthropic's own analysis found Mythos Preview "much better at developing exploits that allow arbitrary code execution" than prior models, achieving this "even with minimal information about the vulnerabilities." The Cyber-ECI, unsaturated benchmarks like ExploitBench and ExploitGym, and UK AISI evaluations all point to Mythos Preview as a large improvement in exploit development—distinct from the murkier picture on discovery.

The reusable frame has five steps. First, take the company's claim at its strongest literal form, not the version softened by a press cycle. Second, separate the capability type: is the claim about discovery, exploitation, end-to-end attack, or something narrower? Third, identify the benchmark or evaluation that maps to that capability, and read the score on its own terms, including what the task actually measures. Fourth, find the closest peer comparator with a public score on the same benchmark; a model that looks novel in isolation often looks incremental against a matched peer. Fifth, ask whether any independent, non-vendor evidence shows the model producing the claimed outcome against a realistic target, and treat the absence of that evidence as informative rather than as confirmation.

Apply that frame to Mythos and the public evidence currently supports a narrow, defensible reading. Mythos Preview and Mythos 5 are two distinct data points, with the second showing modest cyber-benchmark improvement over the first. GPT-5.5 sits close to Mythos Preview on the same benchmarks, which constrains how large the capability gap can credibly be described. The remaining open question, and the one most likely to determine whether Anthropic's safety framing ages well, is whether public exploit evidence will eventually catch up to the company's claim. Until it does, the cyber capability looks real and incremental in discovery, significant in exploit development, and the distance between a benchmark score and a working exploit is where readers should be looking next.

Discovery vs. exploitation: a benchmark-first audit of Mythos' cyber claims

Sources