A 28-day automated search for better neural-network ensembles explored only 4.8% of its own design space. The cause was alphabetical.

A 28-day automated search for better neural-network ensembles explored only 4.8% of its own design space. The cause was alphabetical. — type0 | type0

PREVIEWA 28-day automated search for better neural-network ensembles explored only 4.8% of its own design space. The cause was alphabetical. · MD

For 28 days straight, a single NVIDIA RTX 4090 ran an automated search for better neural-network ensembles, generating 4,463 candidate architectures across 197 batches and evaluating 1,021 of them. The point was to sweep a wide swath of design possibilities for 4-expert Mixture-of-Experts classifiers, a small ensemble of specialised neural networks whose outputs are combined by a learned gating network. By the time the dust settled, the most useful finding was not about the ensembles at all. It was about the search itself.

The authors of the preprint, posted to arXiv as 2606.23739, had built a deterministic generator that picks four base architecture families from the open LEMUR neural-network dataset, a public catalogue of neural-network architectures, and assembles them into 4-expert Mixture-of-Experts ensembles, each governed by a convolutional gating network that learns how to combine them. The generator and full campaign artefacts ship as NNGPT, an open-source Python pipeline for automated neural-architecture assembly, with an optional LangGraph-based multi-agent orchestration mode (Finetuner, Generator, Evaluator, Predictor). The theoretical search space, the set of every possible combination of four architecture families from LEMUR, contains 23,751 entries. The generator managed to enumerate just 4.8% of them, and every reachable combination shared the same first family: AirNet, a convolutional neural-network family that sorts first in the LEMUR catalogue.

The reason is mundane and almost funny. The pipeline used Python's itertools.combinations to enumerate candidate architectures alphabetically. Because AirNet sorts first, it had to be in every single combination the generator could produce. Families further down the alphabet, including FractalNet and MNASNet, were reachable only through combinations that started with AirNet, never on their own. The 'systematic' search was systematic only inside a single slice of the design space, which is the exact kind of coverage bias automated sweeps are meant to remove, not introduce.

The authors flag this themselves in the abstract and propose a fix: replace alphabetical enumeration with stratified random sampling so each family is independently represented in the candidate pool. The corrected generator ships with the open-source NNGPT release. That makes the story unusually clean: the bias was self-diagnosed, the falsifier is a one-line change to the sampler, and the patched tool is already in the same repository. No independent reproduction is needed for the mechanism claim, because the authors demonstrate it in the same paper.

What the campaign did find inside its narrow scope was a real but modest result: a ShuffleNet plus MobileNetV3 ensemble reached a 0.632 mean accuracy on the benchmark, the strongest of the 1,021 evaluated configurations. The authors are explicit that this number lives inside the AirNet-anchored slice and should not be read as a search-wide winner. For a methodology story, that disclaimer is the substance: the strongest ensemble accuracy is downstream of a search whose coverage the team has now openly audited.

The takeaway for anyone running an automated neural-architecture sweep is not that alphabetical iteration is malicious or even unusual. It is built into Python's standard library. The point is that a generator whose output looks exhaustive can quietly be exhaustive only over a corner of its own space. The patch is small. The lesson is that 'systematic' deserves the same scrutiny as any other scientific adjective, and that the most informative output of a 28-day campaign can be the audit of the campaign itself.

A 28-day automated search for better neural-network ensembles explored only 4.8% of its own design space. The cause was alphabetical.

Sources