Trinity paper: a small evolved coordinator routes work across bigger LLMs
A Japan based AI lab trained a roughly 600 million parameter coordinator using evolutionary search instead of the standard gradient based training method (backpropagation).
A Japan based AI lab trained a roughly 600 million parameter coordinator using evolutionary search instead of the standard gradient based training method (backpropagation).
The race to make language models more capable has mostly meant making them larger. A new arXiv preprint, paper 2512.04695, argues a different lever matters: not raw parameter count, but the way a small coordinator delegates work to bigger models. The system, called Trinity, comes from Sakana AI, a Japan-based AI research lab, and is described in the paper as a compact LLM coordinator of roughly 0.6 billion parameters plus a ~10K-parameter head, trained with an evolutionary strategy rather than standard gradient descent, and it claims to beat the individual models it routes work to across coding, math, reasoning, and domain-knowledge tasks the preprint is available at arXiv 2512.04695.
That framing matters because coordination is the part of multi-model systems that usually gets hand-waved. Ensemble and routing work, including FrugalGPT-style cascades and RouteLLM-style query routers, has shown that mixing models can cut cost without losing quality. Trinity pushes the same idea further: instead of picking one model from a menu, the coordinator decides, on every turn, which role each model in the ensemble should play.
The mechanism is the interesting part. At each turn of a multi-turn interaction, the coordinator assigns one of three roles, Thinker, Worker, or Verifier, to a selected LLM in its routed set. The Thinker plans; the Worker executes; the Verifier checks the result. Skill acquisition, in other words, is offloaded to the routed models themselves rather than baked into the coordinator. The coordinator is a router and a role assigner, not a doer. Its ~0.6B parameters and small head are used to read the hidden state of the conversation and decide who does what next, while the heavy lifting stays with the larger LLMs downstream the abstract and design are summarized on the arXiv listing.
Training is the other unusual choice. The coordinator is optimized with CMA-ES, the Covariance Matrix Adaptation Evolution Strategy, a population-based black-box optimizer more common in robotics and classical control than in LLM training. The authors' argument is that the coordinator's policy is small enough, and the reward signal sparse enough, that evolutionary search over a low-dimensional policy is a reasonable match for the problem. The paper attributes Trinity's performance to two design choices, the coordinator's hidden-state representation and the role-and-turn assignment scheme, rather than to any single larger model the paper frames this as a design claim, not a measured cost claim.
The headline number is 86.2% on LiveCodeBench, a frequently updated benchmark of competitive programming problems. Trinity also reports state-of-the-art results on other coding, math, reasoning, and domain-knowledge suites, and claims robust generalization to out-of-distribution tasks, problems drawn from distributions the routed models were not specifically tuned for. Taken at face value, that is a meaningful result: a small router beating the components it routes, and beating prior multi-LLM coordination methods, on standard tests the 86.2% figure and OOD claims are author-reported on the arXiv abstract.
Several caveats belong in the same paragraph. LiveCodeBench is a narrow slice of coding ability, not a measure of production deployment reliability. The 0.6B-plus-10K figure is a footprint, not an inference-cost number: end-to-end cost still depends on which larger LLMs Trinity is calling and how often. The evolved-coordinator result also depends on the routed set, meaning the reported number is really a property of the ensemble plus the coordinator, not of the coordinator alone. And, as with any arXiv preprint, the 86.2% figure and the SOTA framing are author-reported and not peer reviewed the paper is a preprint, not a peer-reviewed publication.
The deeper question is architectural. If a 0.6B-parameter router, evolved rather than gradient-trained, can beat the components it delegates to, that is evidence that coordination overhead is a real design knob, and a cheaper one than scaling the underlying models. It also suggests a different kind of moat for AI labs: not just who can train the largest base model, but who can build the best small brain for an ensemble of them. Whether that holds up under peer review, on benchmarks outside coding, and at production scale is the open question. Watch for the peer-reviewed version of the paper, for replication on non-coding benchmarks, and for any explicit cost-and-throughput numbers comparing Trinity to a single large model on the same hardware. The mechanism is interesting; the deployment case is still to be made the live source of record is the arXiv listing for 2512.04695.