AIBuildAI finishes 181st of 3,219 in Kaggle contest, earning gold‑medal level results.
An AI agent built a model that ranked higher than 94 percent of the human teams entered in a Kaggle competition — so well that a human achieving the same result would have earned a gold medal. By the time anyone published a paper about it, it had already been dethroned.
The agent, called AIBuildAI, was developed by researchers at UC San Diego. Given a task description and a dataset, it designs, trains, and submits a model entirely on its own. The result — rank 181st out of 3,219 teams on the TGS Salt Identification Challenge — demonstrated that full pipeline automation could work on a real competition dataset, not just a synthetic benchmark.
The leaderboard has since moved. On the public MLE-Bench leaderboard, a newer system called Famou-Agent 2.0 overtook AIBuildAI's 63.1 percent medal rate. As of today, Famou-Agent 2.0 sits at 64.4 percent on the same MLE-Bench scale, measured across 21 realistic machine learning competitions. Several other agents cluster between 61 and 62.7 percent, all using similar hierarchical sub-agent architectures. What specifically gives Famou-Agent 2.0 an edge is not publicly documented — neither the AIBuildAI paper nor the MLE-Bench site explains the architectural or methodological difference, and Famou-Agent's own provenance is unclear from available sources.
AIBuildAI's own v2.0, announced April 27, claims 70.7 percent — a significant jump with no independent verification. The same caveat applies to the dethronement framing: Famou-Agent 2.0's 64.4 percent is confirmed on the live leaderboard, but like AIBuildAI's original result, it has not been independently audited.
The metric is straightforward: how often does an automatically generated model earn a medal across a suite of real competitions? A gold medal typically requires ranking in the top 10 percent, silver the top 20 percent, bronze the top 40 percent. AIBuildAI's 63.1 percent medal rate means it would have medaled on most of them.
How it solved TGS Salt is worth understanding. AIBuildAI produced a two-model ensemble: a SegFormer — a vision model with a transformer backbone, commonly used for parsing spatial structure in images — combined with a U-Net — a convolutional network common in image segmentation tasks, good at tracing object boundaries. Each encoder was fine-tuned on the competition data. On the validation set, the blend achieved a mean average precision score of 0.855 — meaning its object boundary predictions were correct at the standard overlap threshold used in the field. The team published the code, the trained weights, and the inference script on GitHub; anyone can run it.
The practical implication is that the gap between having a dataset and having a trained model is compressing. The system requires a Linux machine, an API key, and a task description. The researchers have published definitions for protein classification, heart disease prediction, and other tasks.
This is not yet a threat to most production ML work. The tasks where it succeeds are structured competitions — well-defined metrics, clean datasets, known answer formats. The open-ended, data-cleaning-heavy labor that fills most ML engineers' days is a different problem. But the trajectory matters. AutoML compressed the gap from feature engineering to deployed model over roughly a decade. The agent-loop approach may compress it further.
What rapid leaderboard churn means for practitioners is an open question. If state-of-the-art can move before a paper clears review, the conventional wisdom of waiting for peer validation before acting on a benchmark result becomes harder to sustain. For AutoML vendors — companies that sell automated machine learning pipelines to enterprises — the implications are particularly uncomfortable. Their core pitch is that clients can trust the system to pick the right model for their data. If the benchmark that system uses to justify its choices is obsolete before the ink on the evaluation report is dry, that trust is hard to earn.
The TGS Salt result is a data point, not a destination. The leaderboard it appears on has already changed twice.
AIBuildAI is open source at github.com/aibuildai/AI-Build-AI. The paper is on arXiv at arXiv:2604.14455. MLE-Bench leaderboard is at mlebench.com.