When Microsoft open-sourced RAMPART on May 20, the company framed it as a way to get a clean pass-or-fail signal for AI agent safety, one that drops into continuous integration the same way a regular integration test does. Read the code and the docs and the pass-or-fail story gets more interesting. The default verdict is binary, the famous 80 percent figure is an opt-in example, and the docs contradict each other about what the threshold actually does.
RAMPART is a pytest-native framework layered on PyRIT, Microsoft's existing adversarial-tooling library for AI systems. Its job is to run an agent through adversarial and benign scenarios and report whether the agent behaved safely. The project README describes the result as a Python bool: result.safe. If the agent behaved safely, the assertion passes. If not, the failure message is the human-readable summary. The glossary lists the possible verdicts as SAFE, UNSAFE, UNDETERMINED, and ERROR. There is no aggregate percentage at the top of the report by default.
The 80 percent number comes from the pytest integration guide, and it is doing different work than the blog suggests. The guide documents a @pytest.mark.trial(n=, threshold=) marker for running the same scenario multiple times. The default for threshold is 1.0, meaning 100 percent of trials must be SAFE. The doc then offers @pytest.mark.trial(n=10, threshold=0.8) as a worked example, explaining that the test "should pass at least 80 percent of the time." That is a configurable, per-test knob. It is not the framework's verdict on what counts as a passing agent.
Help Net Security's May 21 coverage is closer to the code. It paraphrases Microsoft as saying RAMPART supports "running the same test multiple times and setting a pass threshold," without claiming 80 percent is a Microsoft position. CyberScoop's same-day piece focuses on CI gating and incident response, and does not mention 80 percent at all. The headline number, in other words, is a worked example, not a Microsoft claim.
Then there is the doc bug. The same pytest integration page defines trial semantics two ways. One section says any UNSAFE result in any trial causes the group to fail. Another says threshold=0.8 "requires at least 80 percent of trials to be SAFE." Those two rules cannot both hold for n=10. If even one UNSAFE fails the group, a threshold that tolerates two UNSAFE out of ten is unreachable. A reader of the docs alone cannot tell which behavior the framework will actually pick.
What the codebase actually implements is closer to the "any UNSAFE fails" reading. The default threshold is 1.0. Trials are statistical repetitions that let a developer dial the tolerance down for noisy or stochastic agents. The 0.8 example is the right shape for a noisy LLM-based agent that might flip a behavior once in ten runs. It is not, despite the blog's framing, a built-in "80 percent safe" verdict that the framework hands back.
Why does this matter beyond the bug? Agent-safety scorecards are about to become a procurement signal. Buyers will look for a number, and a number is what they will get, whether the vendor picked 80, 95, or 100. A useful scorecard tells the reader three things up front: what scenarios ran, what counts as a pass under the default configuration, and what the developer is allowed to tune. RAMPART publishes all three if you know where to look, but the May 20 announcement does not surface them. The marketing summary skipped the seam between an opt-in trial marker and the framework's binary verdict, and the doc inconsistency makes that seam hard to read even for someone motivated to look.
Ram Shankar Siva Kumar, who leads the Microsoft AI Red Team and is the on-record source for both Help Net Security and CyberScoop, told reporters the team generates around 100 variants per vulnerability internally. That is the kind of detail that belongs in the announcement. It is the actual scale at which the team itself runs the tool, and it gives a buyer a number with a denominator.
The constructive read is that Microsoft open-sourced the right tool at the right time. Pytest-native integration, polarity-free evaluators, opt-in trial repetition, and a binary default verdict are all sensible choices for a public framework. The critique is the gap. Pin the commit or fetch date when you cite the docs, because main moves. Read the pytest integration guide end to end before quoting 80 percent, and notice that the example and the trial-semantics section are saying different things. Ask any vendor's safety scorecard the same three questions: what ran, what is the default pass, and what is tunable. The answers are usually in the repo. They are rarely in the press release.