Fair cooperative AI teams have a structural flaw. When a group of agents maximizes egalitarian welfare — lifting the worst-off member as high as possible — every fair agent deliberately surrenders some surplus to help the lowest performer. A single selfish agent can sit back, take the help, and pocket the difference. The textbook fix is to take allocation out of the agents' hands entirely and hand it to a central need-based allocator. That solves the exploitation problem. It also means the team is no longer really decentralized.
A single-author preprint posted to arXiv on 4 June 2026 by Can Savcı argues this framing has been hiding the real question. The paper, "Learning to Contest: Decentralized Robust Fairness in Cooperative MARL via Cross-Attention," contends that the free-rider problem in cooperative multi-agent reinforcement learning (MARL) is not an immutable constraint on decentralized cooperation. It is a design problem with a mappable solution space — once you allow the right kind of contention, agents can defend egalitarian welfare on their own, with explicit and named limits.
The dilemma: fair teams are exploitable, central allocators are the workaround
Cooperative MARL teams trained to maximize the welfare of their worst-off member routinely learn fair, division-of-labor policies. That looks like a virtue. The paper's framing turns it into a vulnerability: the same redistribution that lifts the lowest performer is exactly what a selfish teammate can harvest for free. The numbers, as Savcı reports them, are stark. A fair team of N agents that does nothing to defend itself is roughly N times more exploitable than a team that learns to contest — best-response exploitability ρ scales with team size for the unprotected team, while the proposed method holds ρ in a range the paper reports as approximately 1.2 to 1.5 across the contention settings tested. All numerical claims in this piece are attributed to the Savcı preprint; the paper is v1, not peer-reviewed, and has no independent replication at the time of writing.
The conventional response is a central allocator that decides who gets what based on need. That is, in effect, a referee. It works, but it shifts the problem rather than solving it: the agents themselves never learn to defend their own fairness norm, and the system is only as robust as the allocator.
The pivot: graded contention, not all-or-nothing
The paper's core theoretical move is Proposition 1, which depends on an assumption the author calls "graded contention." Under graded contention, a contested resource delivers 1 − c to the contestant while wasting c — a cost to contesting, not a free fight. The proposition states that for any c < 1, a worst-off cooperator that contests a free-rider strictly improves on yielding. Put differently, the leverage exists. Decentralized agents do not have to take free-riding lying down; they have to be willing to pay for the contest, and as long as the contest is not winner-take-all (c < 1, not c = 1), the worst-off member is strictly better off pushing back.
This is the conceptual reframing. The classic impossibility story says decentralized robust fairness is hard because the free-rider captures the entire surplus the fair agents gave up. Graded contention says that story was an artifact of assuming all-or-nothing contention. Once c is less than 1, the surplus is not all-or-nothing, and a worst-off cooperator can claw back enough of it to be strictly better off contesting than yielding. Decentralized leverage exists. The question becomes: can agents realize it without a central allocator watching?
This load-bearing assumption is also the load-bearing fragility. The author is explicit that Proposition 1 fails under winner-take-all. Robustness is not a blanket guarantee; it is a claim about a specific class of environments, the ones with measurable contest cost. The explainer below maps that scope.
The mechanism: CAN, a cross-attention policy that counts free-riders
To realize the leverage Proposition 1 says exists, Savcı introduces CAN, a permutation-equivariant cross-attention policy. Agents observe each other's behavior, and a cross-attention module aggregates those observations to infer, in effect, how many free-riders are present. The policy then responds proportionally: turn-taking when no one is defecting, contesting just enough when some are. Training is against an adversarial league built via PSRO, so the policy is shaped by encounters with opponents that try to exploit it.
Two design choices are worth flagging. First, CAN is permutation-equivariant: the policy's output does not depend on which specific teammate is in which observation slot. That matters because it is what lets the method scale to different team sizes without retraining the architecture. Second, the response is proportional, not categorical. CAN does not flip between "fair" and "contesting"; it modulates how much it contests based on its estimate of how much free-riding is happening.
The evidence: proximity to a centralized oracle, and a structural baseline failure
Savcı reports two headline empirical results. The first is exploitability: best-response ρ holds in the 1.2–1.5 range rather than scaling with team size. The second is efficiency — the fraction of the centralized oracle's welfare the team actually achieves. At zero contention (D = 0), CAN wastes almost nothing and reaches efficiency close to 1.0. As contention increases (D ≥ 1), efficiency drops into the 0.83–0.96 range the paper reports. The cost is named, not hidden: defending fairness costs some welfare at high contention, and the centralized oracle is still the upper bound on raw efficiency.
The baseline comparison in the paper is the more interesting story. The fair-MARL learners Savcı compares against do not all fail in the same way. GGF and FEN, two established fair cooperative learners, yield to free-riders and remain exploitable. SOTO, by contrast, contests everything and stays robust, but it pays for that robustness by wasting welfare even when no one is free-riding. Each baseline is exposed on a complementary axis. CAN is reported as both efficient at D = 0 and robust under exploitation. The comparison is structural, not a cherry-pick: the baselines fail in different ways, and the proposed method is presented as the first to cover both axes without a central allocator.
The scope: where this holds, and where it does not
The paper's own framing is "clear scope, not blanket generality." Three scope conditions belong in any honest read of the result.
First, Proposition 1 fails under winner-take-all. If c = 1, a contested resource delivers nothing to the contestant, and the worst-off cooperator is no longer strictly better off contesting. The reframing is conditional on graded contention. Environments where a contest is total rather than partial fall outside the result.
Second, the empirical robustness is reported as strong in proportion to contest leverage. The paper's headline testbed is a multi-server game, where a free-rider's presence reduces a contested resource's yield and there is real cost to contesting. As that leverage weakens — for instance, in games where free-riding does not waste much — the reported margin shrinks. The paper does not claim uniform robustness across all cooperative games; it claims robustness where the leverage assumption is met.
Third, the method's zero-shot transfer to larger teams degrades at high contention. The cross-attention architecture is permutation-equivariant, which is what makes transfer architecturally possible. The paper reports that this transfer is not free: when CAN is trained on a team of a given size and evaluated on a larger one under high contention, performance drops. The architecture scales; the learned policy does not, automatically.
These are not caveats to bury. They are the shape of the result. A reader who treats "decentralized robust fairness" as solved will misread the paper. A reader who treats it as "solved in proportion to contest leverage, with named transfer limits, and broken under winner-take-all" is reading the paper the author wrote.
The cost: efficiency at D ≥ 1
One more piece of the result deserves its own paragraph. CAN wastes almost nothing when no one is free-riding (efficiency ≈ 1.0 at D = 0). At D ≥ 1, efficiency drops to the 0.83–0.96 range the paper reports. That is the price of robustness in this design. The centralized oracle is still strictly more efficient at high contention, because the oracle can allocate without paying the contest cost. CAN trades some of that efficiency for being able to defend itself without an allocator. The trade is the point, not a flaw to apologize for.
What changes for cooperative AI
The standard story about free-riding in cooperative AI has two endings. Either the team learns a fair policy and accepts exploitation, or a central allocator steps in and removes the agents' autonomy. The Savcı preprint argues for a third ending, in which the team learns a policy that infers how many free-riders are present and contests them proportionally, paying a measured efficiency cost at high contention and keeping autonomy otherwise. The result is bounded by the contest-leverage assumption, by the winner-take-all boundary, and by the paper's own reported limits on zero-shot transfer to larger teams. Within those bounds, the free-rider problem in cooperative MARL looks less like an impossibility theorem and more like a coordination problem with a mappable design space.
The numbers, the architecture, and the baselines all come from a single v1 preprint by Can Savcı, arXiv:2606.06162, submitted 4 June 2026. The paper is not peer-reviewed, has no independent replication, and no third-party coverage was located at the time of writing. Treat the empirical claims as the author's reported results, not as established facts about cooperative multi-agent learning.