Privacy-preserving training, the technique that lets hospitals, banks, and phone makers build shared machine learning models without ever pooling their raw data, has a reputation problem: every paper seems to claim a new winning recipe, and almost none of them reproduce. NVIDIA's latest preprint suggests one way to break that pattern. Instead of trusting a single benchmark run, the team let an AI coding agent propose and rewrite federated learning algorithms, then forced every candidate through five independent evaluations with different random seeds to see which gains were actually real.
The system, called Auto-FL-Research (AFR), is not a free-roaming AI scientist. It operates inside a set of task profiles that pin down what the agent is allowed to change: the server's aggregation rule, the schedule by which clients update the model, local training objectives, and a registry of approved model variants. Everything outside that mutation surface, including compute budgets, the communication contract between server and clients, and the final evaluation harness, stays fixed. That constraint is the point. Without it, an agent can produce impressive benchmark numbers simply by reshaping the test rather than the algorithm.
AFR records everything it touches. Each campaign keeps the candidate's score, runtime, the files it edited, the artifacts it produced, and whether the run failed outright. The result is an auditable log of how the agent searched, not just what it found.
The team's evaluation sits on two well-known federated benchmarks. FLamby supplies five healthcare cross-silo tasks, where each "client" is a different hospital or research consortium that never sees the others' data. LEAF provides five standard federated datasets plus a synthetic task, grouped into client profiles so the agent trains across simulated user populations. After the coding agent finished proposing recipes, the researchers re-ran the top candidates five times each. The honest results: improvements held up on four of the five FLamby tasks and five of the six LEAF profiles, and faded or reversed on the rest. The preprint treats those mixed outcomes as part of the contribution, not a footnote.
This is where AFR departs from the typical agentic-search paper. Most releases in this space still report a single headline number per task and call it progress. AFR's five-seed repeat is the cheap, unglamorous step that separates a real algorithmic mechanism from a tuning artifact or a lucky initialization. A reader who internalizes that habit, asking "did they repeat it, and with what seeds?," can discount a meaningful share of the AI-research announcements they will see this year.
Two caveats matter before treating AFR as a verdict on AI-driven federated learning research. First, this is an arXiv preprint, not peer-reviewed work; the code is open in the NVFlare repository, but the methodology has not yet been vetted by an independent program committee. Second, the agent's mutation surface is bounded by its task profiles. Constraining what the agent can change is what makes the evaluation honest, but it also means AFR cannot discover the kind of architectural or systems changes a human FL operator might actually need in production. The benchmark gains are real within that surface and undemonstrated outside it.
The open question, and the one worth watching, is whether other groups adopt the same repeat-evaluation discipline. AFR's most transferable contribution is not any specific recipe it found; it is the audit trail and the five-seed reflex. If that habit spreads, the next wave of agentic ML papers will be harder to fake, and the ones that survive will be easier to trust.