BioMedAgent, built on GPT-4o-mini, posts 77% on its own benchmark — outpacing other mini agents and ChatGPT-4o by 30 points
A team at the Chinese Academy of Sciences has published a paper in Nature Biomedical Engineering describing a multi-agent AI framework that chains bioinformatics tools together using natural language commands — and claims it outperforms GPT-4o on biomedical analysis tasks while running on a smaller, cheaper model. The work is real, peer-reviewed, and open source. The headline number needs context.
The system is called BioMedAgent. It is built on GPT-4o-mini — OpenAI's compact and inexpensive model, not a frontier system — and orchestrates multiple specialized agents to plan, write, and execute bioinformatics workflows. Think of it as a command layer: a user types "run a differential expression analysis on this RNA-seq dataset" and the system decomposes that into tool calls, executes them, and remembers which tool chains worked for similar tasks in the past.
That last part is what the paper calls "self-evolving." The phrase carries weight, but the mechanism is more prosaic than the term suggests. BioMedAgent stores successful tool sequences in a memory store and retrieves them when a new task resembles a previous one — a pattern the BioMedAgent GitHub calls Memory Retrieval. It also tries new tool combinations during exploration and keeps what works, which the authors call Interactive Exploration. Neither algorithm changes the underlying model weights. The system gets better at chaining tools, not at thinking.
The Nature Biomedical Engineering paper (DOI 10.1038/s41551-026-01634-6) was published March 30, 2026, with 22 authors led by Dechao Bu of the Institute of Computing Technology and Runsheng Chen of the Institute of Biophysics, both part of the Chinese Academy of Sciences.
On a benchmark the authors built themselves — called BioMed-AQA, comprising 327 open questions across five categories (omics analysis, precision medicine, machine learning, statistics, and data visualization) — BioMedAgent achieved a 77 percent success rate. It outperformed two unnamed OpenAI web agents and a local agent, each also running on GPT-4o-mini, and it outperformed ChatGPT-4o, which achieved 47 percent on the same test.
That 30-percentage-point gap is the paper's most cited result. It is also the result most in need of scrutiny. The authors built both BioMedAgent and the BioMed-AQA benchmark they used to evaluate it. Self-evaluation is standard practice in machine learning research — benchmarks have to come from somewhere — but it means the 77 percent figure reflects performance on a test designed to measure exactly what the system was built to do. Independent evaluation would put that number in sharper relief.
One independent benchmark exists. Future House released BixBench in early 2025: 53 analytical scenarios with nearly 300 open-answer questions covering real-world bioinformatics tasks. On BixBench, frontier models achieved roughly 17 percent accuracy in the open-answer setting. The comparison is imperfect — BixBench and BioMed-AQA measure different things with different methodologies — but 17 percent against 77 percent is a reminder that the benchmark problem in biomedical AI is not solved.
What BioMedAgent demonstrably does is chains tools. It can call bioinformatics libraries like Bioconductor, execute Python scripts, query genomic databases, and generate visualizations — all through natural language. The paper shows example workflows where a user describes a goal and the system builds a multi-step pipeline to achieve it. For a biomedical researcher who knows their science but not their command line, that is a genuine capability improvement over writing scripts by hand or using graphical tools with limited flexibility.
The 77 percent figure was reported on the project's GitHub page and evaluation website at biomed.drai.cn. The full Nature paper is paywalled, but the existence of the publication, author list, and DOI were confirmed via Crossref.
BioMedAgent's code and datasets are open. The BioMed-AQA benchmark is on Hugging Face and Zenodo (DOI 10.5281/zenodo.17430550). That openness is real — anyone can run the benchmark, probe the memory system, or try to break the tool chains.
The paper's central contribution is not a new foundation model or a novel architecture. It is a framework for making GPT-4o-mini useful for biomedical data work by giving it tools and a memory. Whether 77 percent is the right number depends entirely on who is holding the ruler.