The standard story of AI training has been about data: more text, more code, more examples. Researchers building the current generation of reasoning and agentic models are running into a different ceiling, one that more raw data cannot fix. As models get better, the practice problems in any fixed corpus stop teaching them anything new. The work that now decides how much a model improves is the supply of problems calibrated to what the model can actually learn next.
A new preprint introducing a framework called PROPEL makes that constraint concrete. PROPEL trains a small auxiliary model, an "activation probe," to predict how a candidate solver will fare on a generated task before anyone runs the solver, then uses that probe to steer a task generator toward the target difficulty. The reported gain is concrete: on coding tasks for a Qwen2.5-3B-Instruct solver, the share of generated problems landing near the target solve rate rose from 10.1% with a baseline generator to 20.0% with PROPEL. The framing matters here. Share at target solve rate is not the same as building a better agent, and the paper does not claim PROPEL-trained tasks, on their own, produce a stronger final model.
The bottleneck PROPEL targets will sound familiar to anyone who has watched a capable student grind through a stack of textbook problems that are either trivial or impossible. Fixed task distributions saturate as models improve. Naive synthetic generation fills the gap with tasks that are too easy, too hard, or simply ill-posed. The paper's authors argue that the productive level is a moving target: the set of tasks a given model is just barely able to learn, which they call the learnable frontier. Hold a generator steady and the frontier drifts out of reach; move it without calibration and the tasks stop being useful.
PROPEL's trick is to amortize that calibration. The researchers train the probe on a one-time labeled corpus of generated tasks paired with solver outcomes, teaching it to predict a target solver's pass rate from a frozen reference copy of the task generator. During generator training, the probe stands in for the solver. The generator learns to maximize the share of its outputs that the probe rates near a chosen pass-rate target, typically the level where a model is challenged but still able to make progress.
The full set of reported numbers spans three domains: math, code, and software engineering tasks. On software engineering tasks drawn from repositories unseen during probe and generator training, the same comparison for a Qwen3.5-27B solver moved from 9.8% to 19.6%, according to the preprint. In both settings PROPEL roughly doubled the share of generated tasks that land where the solver is learning rather than coasting or failing.
There are also the usual limits of a non-peer-reviewed arXiv posting. Author affiliations and the full list of evaluation domains were not fully captured in the abstract excerpt available at the time of writing. The PROPEL claim of single-forward-pass generator evaluation is the paper's own framing, and independent validation of the framework's impact outside the authors' setup is not yet on the record.
What the work does establish is the shape of the next problem in post-training. Once data volume stops being the bottleneck, the question shifts from how much practice a model can read to how well-engineered that practice is. PROPEL is one early demonstration that the second question can be attacked with learned proxies rather than brute-force solver runs, and that the supply of "just hard enough" problems can be tuned with the same kind of feedback loop that trains the solvers themselves.