For years, the limiting factor in AI training has been compute — GPU time, electricity, data center racks. A paper from researchers at Peking University, Shanghai Artificial Intelligence Laboratory, Nankai University, and the University of Wisconsin-Madison makes a case that the bottleneck is about to move.
EasyRL, accepted to the ACL 2026 Findings conference, describes a reinforcement learning training pipeline — a trial-and-error training method where models learn by optimizing for rewards — that matches or beats state-of-the-art methods using just 10 percent of the labeled data those methods require. The rest gets labeled automatically. If that claim holds up outside the paper, it means labs pouring resources into exhaustive labeled datasets may be solving the wrong problem.
The mechanism is a three-stage pipeline the authors call Knowledge Transfer, Divide-and-Conquer Pseudo Labeling, and Difficulty-Progressive Self-Training. It starts with a small set of easy labeled examples — the kind that are cheap to annotate because the answers are unambiguous. A warm-up model trained on those examples then labels harder problems automatically, but only low-uncertainty outputs, where the model is confident across multiple attempts, get accepted as training data. High-uncertainty cases get discarded. The cycle repeats, pushing the model toward harder territory with each iteration.
The inspiration comes from an unexpected place: Lev Vygotsky's Zone of Proximal Development, a 1978 theory of how children learn. The paper cites it directly, arguing that the easy-to-hard progression in EasyRL mirrors how humans acquire skills — building foundations on tractable cases before moving to harder problems. A separate framework called AERO, published on arXiv in February 2026, reached the same cognitive science inspiration independently, which suggests the ZPD framing is becoming a genuine research thread rather than a post-hoc justification.
The pressure is real even if the validation is not. Reinforcement learning pipelines that rely on unsupervised methods — generating their own reward signals without human labels — are susceptible to two documented failure modes. The first is reward hacking: the model finds a shortcut that looks correct according to its automated judges but does not generalize. The second is model collapse: when models train on data they or similar models generated, rare patterns start disappearing from their outputs. A 2024 paper in Nature showed model collapse is not a theoretical risk. It is an empirical result.
The standard fix for both problems is human annotation — expensive, slow, but reliable. EasyRL tries to split the difference by using a small initial labeled set as an anchor and then automating the rest, filtering for confidence at each step. The paper claims this approach avoids both failure modes while using a fraction of the labeled data.
The ACL 2026 Findings acceptance is a real peer-reviewed venue, which is more than most arXiv preprints can claim. The code is on GitHub, so the method can be tested. But every new training framework produces strong benchmark results in its own paper, and independent replication has not yet materialized for EasyRL. Whether the 10 percent claim holds on different hardware, different data, or a different intended use case is the question that matters — and nobody has answered it yet.
What the paper does establish is that the question is worth asking. As RL pipelines get more sophisticated and data efficiency improves, the competitive question shifts from "how much data can you afford to label?" to "who knows which easy examples are worth labeling in the first place?" Annotation is not commodity work when it matters. Identifying genuinely instructive training cases — the kind that transfer to harder problems — requires judgment that does not scale linearly with headcount. If data-efficient RL becomes standard, the labs and annotation services with the deepest understanding of what good training data looks like for specific domains are the ones with leverage.
EasyRL is an academic paper from institutions that are serious but not among the handful of labs that typically set the field's direction. Whether it ends up in a production training pipeline somewhere depends on whether someone with a real RL workload runs it honestly and reports back. Watch for that report. It will be more informative than the paper itself.