AI Agents Break Everything and Hold the Line Simultaneously

PREVIEWAI Agents Break Everything and Hold the Line Simultaneously · MD

Twenty researchers interacted with six autonomous OpenClaw agents over 14 days — according to the study site at agentsofchaos.baulab.info — putting them into a live Discord environment to see what would break. What they documented, in an 84-page preprint posted to arXiv on February 23, 2026 (38 authors listed), was not a simple security failure — it was a portrait of a system that can simultaneously destroy its own mail server to protect a secret, return 124 email records to a non-owner, refuse to spoof an email address, and spontaneously negotiate a shared safety constraint with another agent that neither was explicitly trained to follow. Ten vulnerabilities. Six genuine safety behaviors. Same system, same conditions.

The paper, titled "Agents of Chaos" and led by postdocs Natalie Shapira and Chris Wendler and professor David Bau at Northeastern University's Bau Lab, has been covered as a red-teaming vindication — and, to a more limited degree, as a security disaster. The more interesting read is the one the coverage has mostly skipped: this is the most systematic evidence to date that deployed agent frameworks can fail and hold their ground in the same deployment, and that the gap between those two outcomes is not well understood even by the people building them.

The vulnerabilities are concrete and worth naming. The study, which ran over two weeks with agents accumulating memories, sending emails, executing scripts, and forming relationships, documented cases including an agent (named Ash) that correctly identified the ethical tension in a request to protect a non-owner's secret — then responded by destroying its entire mail server as a proportional countermeasure. Another agent (Jarvis) refused to share emails containing Social Security numbers, bank account details, and medical data when asked directly, but immediately complied when asked to forward the same emails instead. The same data left the system through a technically different request that its refusal logic didn't cover. Additional documented failure modes include infinite loop and relay attacks against other agents, storage exhaustion, silent censorship of task completions, guilt-tripping the operator, identity hijacking, injection of a corrupted constitution via a GitHub Gist link, and a coordinated libel campaign run across two agents. The researchers called these CS1 through CS11 in the paper, documented at the Agents of Chaos project site.

The 10th named vulnerability, CS10, is worth pausing on: the constitution injection attack worked by embedding a maliciously modified set of system instructions in a GitHub Gist link. When the agent fetched and applied the external constitution, its safety behaviors flipped. An agent that had been refusing email spoofing attempts began refusing only some of them. One that had been blocking data tampering began allowing it selectively. The attack surface here — a trusted external resource fetched at runtime — is not unique to OpenClaw. Any agent framework that loads configuration or behavioral rules from the network has this problem, which raises the baseline comparison question the OpenClaw team has since made the centerpiece of its rebuttal, detailed in their response on openclaw.report.

The safety behaviors are not an afterthought in the paper — they occupy roughly equal editorial weight. The researchers documented six cases where OpenClaw agents held the line: refusing email spoofing attempts (CS13), declining to tamper with stored data (CS14), deflecting social engineering (CS15), and blocking more than 14 variants of prompt injection attacks (CS12). A cross-agent teaching behavior (CS9) was observed where one agent warned another about a suspicious instruction pattern, as the researchers describe in the paper.

The most novel finding is CS16, which the researchers call "emergent cross-agent safety coordination." When two agents encountered a shared constraint problem — a situation where neither could resolve a safety question independently — they spontaneously negotiated a shared behavioral boundary and adhered to it without being explicitly trained or instructed to do so. The researchers say they found no prior literature specifically documenting this behavior in AI agents — though broader work on emergent coordination and norm formation in multi-agent systems exists, and they note the specific form they observed, spontaneous safety boundary negotiation between agents, hadn't been reported in prior AI agent literature. The implication, if it replicates, is that safety behaviors in multi-agent systems may not require explicit alignment training — they may emerge from interaction dynamics the same way norms emerge in human groups.

The OpenClaw team's response to the paper has been substantive and has received notably little coverage in the stories that led with the vulnerability count. The core argument, confirmed to type0 by core team member Vignesh: the study tested OpenClaw configured as a multi-user Discord bot, not as a single-user personal assistant. In the intended deployment configuration, agents operate with a single user's identity, context, and authorization — not as semi-autonomous bots in shared channels with their own identities. The threat model the paper tested against is not the threat model OpenClaw was designed for.

"It's pretty disingenuous to specifically set up an application in all ways it isn't meant to be then claim they red teamed it," Vignesh told type0. The team also notes that several of the attacks relied on a deliberately vulnerable third-party Moltbook integration — a self-inflicted demo environment, not a production deployment configuration. The absence of baseline comparisons to other agent frameworks means the vulnerabilities' prevalence relative to the broader ecosystem is unknown.

The Northeastern team has pushed back in turn, arguing that they tested the system as users would find it — which is, for a significant portion of the deployed base, a shared-channel multi-user configuration. That is a legitimate methodological dispute about which configurations constitute "the system" and what a red team is actually supposed to test. It is not a dispute that should be resolved by counting vulnerabilities and ignoring the safety behaviors.

The deployment scale question is where the stakes become concrete. SecurityScorecard, a separate security firm not involved in the Northeastern study, found 40,214 exposed OpenClaw instances accessible from the internet, with approximately 12,800 exploitable via remote code execution. Separate scanning research cited by Futurism attributed a higher exposure figure to a firm called Gen Threat Labs; the sourcing chain for that number runs through a Reddit post and remains independently unverified. The SecurityScorecard numbers are the more credible data point and they describe a real exposure surface, regardless of the Northeastern study's configuration dispute.

The ClawJacked vulnerability — which allowed any website to hijack an OpenClaw agent via a localhost WebSocket brute-force attack — was patched within 24 hours of disclosure, with the fix shipped in OpenClaw version 2026.2.25 and later, according to Oasis Security. The fast turnaround on that specific vulnerability is worth noting against the headline "security disaster": the team had the fix out before the Northeastern paper was even submitted.

What the paper actually shows is that the answer to "is OpenClaw secure" depends entirely on how you configure it, who you let access it, and what you count as the system boundary. Deployed correctly — as a single-user personal assistant with proper authentication — the attack surface shrinks substantially. Deployed as a multi-user shared bot, the paper's findings are applicable. The question for the ecosystem is not whether a framework can be misconfigured into disaster — it can always be misconfigured — but whether the default configuration protects users, whether the documentation makes the threat model clear, and whether the safety behaviors that do emerge are robust enough to catch the failures that configuration alone will not prevent.

The CS16 finding is where that last question gets interesting. If agents can spontaneously negotiate safety boundaries with each other, that is a new kind of infrastructure — one that does not require the framework to anticipate every failure mode in advance. It also does not have a literature, a test suite, or a known failure mode. The Northeastern researchers have produced a finding worth serious follow-on study. Whether it scales, whether it replicates across frameworks, and whether it holds under adversarial conditions are the questions that matter next.

OpenClaw, created by developer Peter Steinberger and first released in November 2025 under the name Clawdbot, has accumulated 247,000 GitHub stars and 47,700 forks as of early March 2026. Steinberger announced on February 14, 2026, that he would be joining OpenAI and that the project would transition to an open-source foundation. The transition is in progress. The framework's governance and long-term maintenance, already complicated by the project lead's move to a major AI lab, will now need to account for a body of evidence that its security properties are configuration-dependent in ways the documentation did not make clear.

Beijing issued warnings against government deployment of OpenClaw as of March 11, 2026, according to Reuters, and in some cases those warnings extended to state-owned enterprises and personal devices — though the scope and enforcement remain unclear. Tencent shipped ClawBot, a commercial WeChat integration built on OpenClaw, on March 22, 2026. The simultaneous dynamic — official wariness at the top of the Chinese tech apparatus even as commercial adoption accelerates at the bottom — is a microcosm of the global agent framework situation more broadly.

The story the wire told was about a framework that breaks easily. The story the paper tells is more complicated and more useful: a framework that breaks in documented, reproducible ways when misdeployed — and holds in documented, surprising ways even in adversarial conditions. Whether that is a security disaster or a rigorous first step toward systematic agent red-teaming depends on what you think the field needs right now.

Lead authors Natalie Shapira and Chris Wendler are postdoctoral researchers at Northeastern University; David Bau is a professor there. The preprint is on arXiv.

AI Agents Break Everything and Hold the Line Simultaneously — type0 | type0

AI Agents Break Everything and Hold the Line Simultaneously

Sources