A new peer-reviewed RAND assessment finds that large language model agents, the same kind of software systems that book travel and write code, can now perform the opening moves of biological tool use on their own. They can select a piece of lab equipment, pick the right software, and follow a protocol far enough to do real work. The seven agents the study tested did not finish the job. They did not have to. The interesting finding is that the initial step, the one that used to require a graduate-level bench course or a wet-lab internship, is the step that just got easier, and the field has not built a test suite to catch the shift.
The report, titled "Can LLM Agents Select and Engage with Biological Tools? An Initial Biosecurity Assessment" and published on June 25, 2026, is the kind of structured evaluation that biosecurity researchers have been asking the industry for. RAND, the US policy research organization best known for Cold War strategy work, tested seven agents on a curated set of biological tools: wet-lab instruments, lab software, and step-by-step protocols. The agents varied in how far they got. None of them demonstrated the kind of end-to-end capability that would worry a virologist. Several of them cleared the entry bar, and that is what the authors flagged.
The framing matters. "Biological tools" in the report covers the routine apparatus of a working life-science lab: centrifuges, plate readers, PCR machines, freezer inventories, lab information management systems, and the protocols that tell a technician what order to do things in. It does not mean pathogens. The worry the authors raise is not that an agent can synthesize a threat agent from a prompt. The worry is that an agent can do the first twenty percent of a real lab workflow, the part that has historically screened out people who do not have the training to keep going.
That distinction is what makes the 78-page report useful for defenders. The agents succeeded at the steps any competent technician would handle in week one of a new job: choosing the right instrument for a given task, navigating the software that controls it, and reading a protocol to the point of running the first reaction. They were less reliable at the parts that require a working scientist, including troubleshooting a failed run, interpreting unexpected results, and adjusting a protocol when the obvious move does not work. The expertise floor dropped at the bottom of the staircase, not the top.
For policymakers, evaluators, and the labs that run safety programs, that asymmetry is the actionable finding. It suggests a new category of test is needed: not "can the agent build a threat," which is the wrong question, but "can the agent get past the first checkpoint of a real workflow without help." The authors recommend targeted, structured testing of agent design capabilities, which is a deliberately narrow ask. They are not calling for a moratorium on agent development or for new export controls. They are calling for measurements, the kind that let a defender see the capability move before the misuse does.
The risk this kind of study tries to head off is the one where the entry-level part of biology gets quietly automated away. A graduate student used to spend a year learning which knob to turn and which software to open. An agent now spends a few minutes. If the testing infrastructure does not catch up, the bottleneck for early-stage misuse shifts from "does the actor know enough to start" to "does the actor have enough money to start." That is a real shift, and it is the kind of shift that a careful evaluation can expose before it becomes a problem.
There are honest limits to the finding. The report covers seven agents at one moment in time, on tasks the authors chose, with success measured against benchmarks the field is still working out. Capability moves fast. A test that flags the entry-level risk today will not flag the mid-tier risk next year without an update. The authors are careful to say so, and the report is structured to be repeatable rather than definitive. It is a baseline, not a verdict.
The next trigger to watch is whether the agent developers, the model evaluators, and the biosecurity community adopt the structured testing the report recommends. If a shared red-team norm emerges, the same kind of standard that grew up around image generators and cyber tools, the expertise-floor problem becomes measurable and contestable. If it does not, the gap between what an agent can do at step one and what a defender is ready to catch at step one widens in silence.