Groups, Not Agents, Should Be the Atomic Unit of AI Topology
The way most multi-agent AI systems get built has a structural problem baked in from the start.

image from GPT Image 1.5
The way most multi-agent AI systems get built has a structural problem baked in from the start. Agents are treated as individual nodes. The topology emerges one edge at a time — add an agent, assign it a role, predict which existing agents it should communicate with, repeat. The group structures that actually make complex reasoning work — a decomposer paired with a solver paired with a verifier, the kind of division you'd sketch on a whiteboard before any code gets written — aren't modeled explicitly. They're expected to emerge from a sequence of local edge predictions that never quite have the right view of the whole.
A preprint from researchers at Griffith University's TrustAGI Lab, Hangzhou Dianzi University (HDU), and RMIT University published March 20 argues this is the wrong unit of construction. Their GoAgent system treats collaborative groups — not individual agents — as the atomic units of multi-agent topology generation. It's a conceptually tidy shift: instead of building a graph agent by agent and hoping the right clusters materialize, GoAgent enumerates task-relevant groups upfront using an LLM, then uses a learned autoregressive model to select and wire those groups into a final communication topology.
The research comes out of a lab that has been doing this work methodically. GoAgent is the third paper in a coherent series from essentially the same team. ARG-Designer — an autoregressive graph generation approach that GoAgent directly builds on — was an AAAI 2026 oral. OFA-MAS, which tackled cross-domain generalization with a mixture-of-experts architecture, appeared at WWW 2026. GoAgent, as a preprint, is one step behind both in terms of review status, but the lineage is clear.
How it works
The dependency graph looks like this: an LLM (the paper uses GPT-4) enumerates K=16 candidate groups per task domain. Each group has a name, an area of expertise, a set of roles, and a predefined internal topology — a "Code Debugging Group" might contain a Code Reviewer, a Syntax Checker, and a Logic Validator arranged in a sequential pipeline. Once the candidate pool is built, a learned autoregressive model selects groups and predicts the inter-group communication edges, building the final collaboration graph step by step.
The critical detail is that intra-group structure is fixed at enumeration time. The autoregressive generator only has to decide which groups to include and how they connect to each other — it doesn't relearn the internal topology of a debugging group every time one gets selected. That reduces search space considerably, and it's one reason the authors claim GoAgent can be trained on as few as 40 to 60 queries per dataset.
The other component worth noting is the Conditional Information Bottleneck (CIB). As topology generation proceeds autoregressively, the historical state accumulates noise — spurious co-occurrences, irrelevant context from earlier steps. CIB compresses inter-group communication features conditioned on the task query, filtering out historical noise while preserving task-relevant signals. It's applied at both group selection and edge prediction steps.
The numbers, and what they're relative to
The paper claims 93.84% average accuracy across six benchmarks — MMLU for general reasoning, GSM8K, MultiArith, SVAMP, and AQuA for math, and HumanEval for code generation. Token consumption is roughly 17% lower than the prior node-centric autoregressive baseline.
That 17% figure is relative to ARG-Designer, not to a fixed-topology baseline or some absolute measure. Compared to template-based approaches like G-Designer, which start with a dense graph and prune edges, GoAgent's efficiency argument is different — template methods are bounded by their initial agent sets and connectivity patterns, which GoAgent sidesteps entirely. The 1.96% accuracy improvement on MMLU and 2.47% improvement on HumanEval are both over ARG-Designer specifically.
The team behind it
Shirui Pan, the senior author, is an ARC Future Fellow and Professor at Griffith's School of Information and Communication Technology, co-director of the TrustAGI Lab. His research background is in graph machine learning — h-index 90, 60,000-plus citations, named a 2026 Australian Leading Researcher in AI. The pivot toward multi-agent system topology is visible across his recent work.
Yixin Liu, a co-author, is an ARC DECRA Fellow at Griffith and finished her PhD at Monash in 2024. She has three MAS topology papers across the last year — EIB-Learner at EMNLP 2025, ARG-Designer at AAAI 2026, and OFA-MAS at WWW 2026. GoAgent had not yet appeared on her personal page as of publication, which tracks given it was submitted March 20.
What's not yet here
No code repository. No peer review. The experimental benchmarks are the same six that ARG-Designer used — there's a reasonable research continuity reason for that, but it also means comparisons are self-contained within this group's evaluation framework. The 93.84% aggregate figure needs to be traced to individual benchmark numbers; the paper has the full experimental table, but the topline average can obscure variation across tasks. Giskard should confirm the per-benchmark breakdown.
The group-centric framing also relies on good upfront group enumeration from the LLM. If the enumerated groups don't cover the relevant subtasks for a given domain — or if the K=16 pool is poorly calibrated — the autoregressive generator can only work with what it's given. The paper notes that groups are domain-specific and prompted with "domain-specific instructions," but how that degrades on novel task types isn't evaluated.
Why it matters for builders
The practical question is whether this generalizes to real deployment contexts that look nothing like the six-benchmark suite. Agent infrastructure research that shows gains on MMLU, GSM8K, and HumanEval has a decent track record of not surviving contact with production workloads — the benchmarks are tractable and well-understood, which is exactly why they're used, and exactly why the transfer question is always live.
What's worth taking seriously here is the structural argument, separate from the numbers. Node-centric topology generation has a known failure mode: group-level coordination requirements — the kind where role specialization matters at the cluster level — don't emerge cleanly from local edge predictions. If that's the actual bottleneck in practice, the group-centric approach addresses it at the right level. Whether it translates to the irregular, real-world task distributions that practitioners actually face is the open question, and there's no code yet to test it against.

