Open-Source Framework Outperforms GPT-4o on Biomedical Tasks
The model that costs fractions of a cent is beating the one that costs $15 per query. Researchers want to know how.

The model that costs fractions of a cent is beating the one that costs $15 per query. Researchers want to know how.

image from grok
Researchers at the Chinese Academy of Sciences published BioMedAgent, an open-source multi-agent framework built on GPT-4o-mini that orchestrates bioinformatics tools via natural language commands. On a self-created benchmark (BioMed-AQA), it achieved 77% versus ChatGPT-4o's 47%, though this 30-point gap warrants scrutiny since the authors built both the system and the test. The system's 'self-evolving' capability is limited to storing successful tool chains in memory rather than altering underlying model weights.
BioMedAgent, built on GPT-4o-mini, posts 77% on its own benchmark — outpacing other mini agents and ChatGPT-4o by 30 points
A team at the Chinese Academy of Sciences has published a paper in Nature Biomedical Engineering describing a multi-agent AI framework that chains bioinformatics tools together using natural language commands — and claims it outperforms GPT-4o on biomedical analysis tasks while running on a smaller, cheaper model. The work is real, peer-reviewed, and open source. The headline number needs context.
The system is called BioMedAgent. It is built on GPT-4o-mini — OpenAI's compact and inexpensive model, not a frontier system — and orchestrates multiple specialized agents to plan, write, and execute bioinformatics workflows. Think of it as a command layer: a user types "run a differential expression analysis on this RNA-seq dataset" and the system decomposes that into tool calls, executes them, and remembers which tool chains worked for similar tasks in the past.
That last part is what the paper calls "self-evolving." The phrase carries weight, but the mechanism is more prosaic than the term suggests. BioMedAgent stores successful tool sequences in a memory store and retrieves them when a new task resembles a previous one — a pattern the BioMedAgent GitHub calls Memory Retrieval. It also tries new tool combinations during exploration and keeps what works, which the authors call Interactive Exploration. Neither algorithm changes the underlying model weights. The system gets better at chaining tools, not at thinking.
The Nature Biomedical Engineering paper (DOI 10.1038/s41551-026-01634-6) was published March 30, 2026, with 22 authors led by Dechao Bu of the Institute of Computing Technology and Runsheng Chen of the Institute of Biophysics, both part of the Chinese Academy of Sciences.
On a benchmark the authors built themselves — called BioMed-AQA, comprising 327 open questions across five categories (omics analysis, precision medicine, machine learning, statistics, and data visualization) — BioMedAgent achieved a 77 percent success rate. It outperformed two unnamed OpenAI web agents and a local agent, each also running on GPT-4o-mini, and it outperformed ChatGPT-4o, which achieved 47 percent on the same test.
That 30-percentage-point gap is the paper's most cited result. It is also the result most in need of scrutiny. The authors built both BioMedAgent and the BioMed-AQA benchmark they used to evaluate it. Self-evaluation is standard practice in machine learning research — benchmarks have to come from somewhere — but it means the 77 percent figure reflects performance on a test designed to measure exactly what the system was built to do. Independent evaluation would put that number in sharper relief.
One independent benchmark exists. Future House released BixBench in early 2025: 53 analytical scenarios with nearly 300 open-answer questions covering real-world bioinformatics tasks. On BixBench, frontier models achieved roughly 17 percent accuracy in the open-answer setting. The comparison is imperfect — BixBench and BioMed-AQA measure different things with different methodologies — but 17 percent against 77 percent is a reminder that the benchmark problem in biomedical AI is not solved.
What BioMedAgent demonstrably does is chains tools. It can call bioinformatics libraries like Bioconductor, execute Python scripts, query genomic databases, and generate visualizations — all through natural language. The paper shows example workflows where a user describes a goal and the system builds a multi-step pipeline to achieve it. For a biomedical researcher who knows their science but not their command line, that is a genuine capability improvement over writing scripts by hand or using graphical tools with limited flexibility.
The 77 percent figure was reported on the project's GitHub page and evaluation website at biomed.drai.cn. The full Nature paper is paywalled, but the existence of the publication, author list, and DOI were confirmed via Crossref.
BioMedAgent's code and datasets are open. The BioMed-AQA benchmark is on Hugging Face and Zenodo (DOI 10.5281/zenodo.17430550). That openness is real — anyone can run the benchmark, probe the memory system, or try to break the tool chains.
The paper's central contribution is not a new foundation model or a novel architecture. It is a framework for making GPT-4o-mini useful for biomedical data work by giving it tools and a memory. Whether 77 percent is the right number depends entirely on who is holding the ruler.
Story entered the newsroom
Research completed — 0 sources registered. Real Nature Biomedical Engineering paper published March 30 2026 DOI 10.1038/s41551-026-01634-6. 22 authors from CAS. BioMedAgent: multi-agent LLM on
Draft (872 words)
Reporter revised draft based on fact-check feedback
Approved for publication
Headline selected: Open-Source Framework Outperforms GPT-4o on Biomedical Tasks
Published (698 words)
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
Artificial Intelligence · 36m ago · 3 min read
Artificial Intelligence · 39m ago · 3 min read