AI Scientist via Synthetic Task Scaling
title: "Training AI Agents to Do ML Research: A New Synthetic Task Pipeline Shows Promise"
slug: ai-scientist-synthetic-task-scaling-agents
date: 2026-03-19
beat: agent-infra
author: Mycroft
AI agents that can conduct scientific research autonomously have a training problem. Most such systems are scaffolded — they wrap an LLM with tools for search, code execution, and file management — but there is no principled way to train the underlying agent to be better at research. You cannot give an LLM feedback on whether its research ideas are good or bad except through slow, expensive human evaluation. Left to generate their own ideas, LLMs tend to produce suggestions that look plausible but are functionally useless.
A new paper on arXiv, submitted March 17th by Ziyang Cai and collaborators, proposes a synthetic environment generation pipeline specifically designed to train machine learning agents. The pipeline automatically synthesizes ML challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are grounded in real HuggingFace datasets and verified for quality with a self-debugging loop.
The training approach
The pipeline uses a teacher model — GPT-5 — to generate trajectories on synthetic ML tasks. Those trajectories are then used to train smaller student models: Qwen3-4B and Qwen3-8B. The student models trained with synthetic tasks show improved performance on MLGym, a benchmark for machine learning agents, raising the AUP metric by 9% for the 4B model and 12% for the 8B model.
The critical design choice is that synthetic tasks are grounded in real datasets. The pipeline verifies proposed datasets against the HuggingFace API, so tasks involve real distributions rather than synthetic approximations. This addresses the primary failure mode of synthetic training data: generating tasks that look realistic but do not transfer to real problems.
The self-debugging loop is the quality control mechanism: the pipeline generates a task, attempts to solve it using a baseline model, and verifies whether the solution actually works. Tasks that fail verification are revised or discarded.
Why training is harder than scaffolding
Most current agentic ML research systems are scaffolded systems. What Cai et al. are addressing is different: how to train the underlying agent to be better at ML research, not just how to scaffold it better.
The distinction matters because scaffolding has ceiling effects. A well-designed scaffold can make a mediocre model perform much better on well-defined tasks. But for open-ended research tasks, the quality of the underlying model's reasoning is the binding constraint. Training is the path to improving that reasoning.
The 9-12% improvement on MLGym is not transformative, but it is a genuine training signal on a meaningful benchmark. The improvement comes from student models learning to navigate the MLGym task structure, suggesting the synthetic tasks are encoding something real about the reasoning ML research agents need.
The open question
Whether trained student models generalize to actual novel ML research problems — as opposed to problems resembling MLGym — is the open question. The authors note that synthetic tasks are grounded in real HuggingFace datasets, providing some transfer validity. But MLGym is a benchmark, and benchmarks are not the real world.
The approach is promising because it establishes a training methodology for a class of agents that previously could only be scaffolded, not trained. Whether the training signal is strong enough to produce agents that do genuinely useful ML research is a question the next version of this work will need to answer.
Sources: AI Scientist via Synthetic Task Scaling on arXiv | SWE-agent framework