An autonomous network defender that maximizes a security score while torching the network's uptime, false-positive, and change-management budgets is not a defender. It is a liability. That is the operational risk a new arXiv preprint on the CAGE Challenge 4 benchmark sets out to quantify, and the lever it pulls is a simple one: stop letting the AI chase a reward and start making it honor an explicit contract.
The benchmark, CAGE Challenge 4, simulates a Security Operations Centre (the team that triages and contains network incidents) under steady attack. Defenders are scored on Mean Time to Recover, the false-positive rate of their automated responses, and the disruption their firewall changes cause. The paper, by Jose Luis Lima de Jesus Silva, frames the core problem plainly: multi-agent reinforcement learning, where several AI defenders share a policy learned by trial and error in simulation, can lift a security reward while remaining non-deployable in a real operations centre.
The headline number is the cleanest evidence. Unconstrained learning methods, agents trained only to maximize the security reward, breached the 50-minute downtime budget in 100% of test episodes, running 311 to 430 minutes of recovery time per incident. The fix proposed in the preprint is to treat safety not as a penalty tucked inside the reward function but as a contract surface. A constrained variant, C-MAPPO-GAT, wraps a standard multi-agent algorithm in explicit budgets for downtime, false positives, and change-management, and adds a screen that rejects any action that would blow them. The result, per the paper: budget violations fall from 100% to 0.3%, and mean downtime cost drops from 355.4 to 15.5.
A second instantiation, ACD³-GAT, layers in tail-risk awareness, a model of the opponent's likely next move, and a graph-based counterfactual risk propagation pass that estimates how an action would look under plausible attacker adaptations. Its reported result is 13.8% budget violation at a mean downtime cost of 48.2. The paper is explicit about what that number means: ACD³-GAT sits on the safety-contract frontier, not at the most conservative compliance point. A risk-averse operator can choose the tighter option; an operator willing to trade more downtime for the higher security payoff can dial ACD³-GAT up. That dial is the point.
The caveats belong on the page. CAGE Challenge 4 is a simulator, not a live operations centre, and the reported numbers come from a single-author preprint rather than a peer-reviewed venue. Three seeds and a stress-test suite are not the same as a production deployment under alert-fatigue conditions, and no third-party security operations vendor has independently validated the result. The paper's own framing matters here: it is arguing that reward-only multi-agent learning is non-deployable as written, and proposing a path that might be, not claiming it already is.
What travels beyond network security is the architecture. The safety-contract idea, explicit operational budgets, counterfactual action screening, and a place on the safety frontier the operator can dial, is not a network-defense result. It is a template for any multi-agent system that will eventually be trusted to spend real-world budgets: grid balancing, warehouse robotics, clinical workflow automation. The 100% violation number is the cautionary hook. The 0.3% number is the constructive one. The work between them is what makes an autonomous agent worth putting on call.