Every enterprise running a fleet of AI agents is paying humans to do the learning the agents are supposed to automate. A new arXiv preprint from researchers at Ant Group, HKUST, and Tsinghua argues that the bottleneck is not model size, training data, or optimizer cleverness. It is a three-layer infrastructure gap that keeps deployed agents frozen and forces companies into a slow, manual loop of inspect, edit, retrain, redeploy.
The paper, titled "Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents" and posted 1 July 2026 with a v2 update the next day, focuses on the production reality most enterprise AI vendors do not put in their pitch decks. Coding assistants, customer-support chatbots, and research assistants built on large language models ship with frozen weights, frozen system prompts, frozen tool menus, and frozen "harnesses" (the scaffolding that lets the model call other software and remember context). Anything that changes after deployment requires a human to collect curated data, fine-tune the model offline, adjust the agent's logic, and ship a new version.
That manual loop is expensive and slow. It is also, the authors argue, the wrong shape for the next stage of agent capability. Recent academic and industry work on what researchers call "self-evolving agents" suggests the real improvement will come from agents that learn continuously from their own work. The paper concedes that early prototypes, including a system the authors built for individual users called "OpenClaw", already hint at the pattern. The barrier at enterprise scale is not the learning algorithm. It is the missing system around it.
The paper names three missing layers.
First, there is no standardized way to record what an agent actually did. An enterprise agent might call a code interpreter, query a database, hand off to another model, and write back to a user, all in a single task. Each step carries a different learning signal (a reward, a failure, a tool error, a successful tool call) but today's logs treat the whole trajectory as a black box. Without a protocol that captures the right signal at the right granularity, downstream training has nothing to learn from.
Second, there is no enterprise-grade proxy that turns that messy daily work into clean, governed training data. Production logs are full of personally identifiable information, secrets, partial failures, and one-off hacks. A usable learning substrate needs filtering, governance, deduplication, and replay. That middle layer is what MLOps vendors have built for model training, but the paper argues agents need a different version of it: one that operates at task-trajectory scale, not batch scale.
Third, and most importantly, there is no control plane that decides what to update. An improving agent can change its memory, its skills library, its system prompt, its tool roster, or its underlying model weights. Each is a different blast radius. Updating a system prompt is reversible in seconds; updating a model's weights is a multi-day fine-tune with its own safety review. The paper argues the missing layer is software that watches production statistics and chooses, automatically, which lever to pull and when.
The authors sketch what that looks like with AReaL2.0, a scoped instantiation built on the existing AReaL open-source framework from Tsinghua's IIIS lab and Ant Group's AReaL team. AReaL2.0 routes live agent calls through an online reinforcement-learning service so interaction traces can train future model updates without a separate offline pipeline. Recent AReaL releases have added features such as token-masking configurations (the "KPop" and "IcePop" presets) and integration with NVIDIA's TensorRT-LLM Scaffoldings for separating execution, reward, and trajectory collection. Those are infrastructure moves, not model moves.
The paper is candid that AReaL2.0 is "one narrow version" of the broader vision. It only addresses the policy-weight update path, not the full control plane. Independent benchmarks, customer deployments, and safety or governance audits of AReaL2.0 do not yet exist, so any production-adoption claim would be premature. The same goes for the more aggressive reading, common in curator commentary on social media, that the paper proves "self-improving agents" are arriving. The honest read is narrower: enterprise AI is bottlenecked by infrastructure, and the next competitive moat will be the team that builds the control plane that decides what an agent is allowed to change about itself, and when.