Cursor Tab retrains every 90 minutes. That is what AI "learning on the job" actually looks like.
The code editor's autocomplete model retraining itself from 400 million user requests a day.
The code editor's autocomplete model retraining itself from 400 million user requests a day.
Cursor's Tab, the ghost-text autocomplete in its code editor, retrains its model roughly every ninety minutes from 400 million-plus daily requests, according to the company's engineering blog. That is the clearest proof of life so far for AI that genuinely learns on the job, and it is shipping at scale.
The shipped product is more interesting than the prediction. In a recent monologue on his blog, podcaster Dwarkesh Patel argued that the next AI breakthrough will come from training models on millions of verifiable tasks across thousands of reinforcement-learning environments, with sample inefficiency amortized across billions of deployment sessions. That argument is a thesis, not an experiment. Cursor Tab is the experiment.
The mechanism underneath is called on-policy self-distillation, sometimes shortened to OPSD or OPD. The idea, as Thinking Machines Lab describes it, is to align a base model's token-by-token predictions with the predictions of a "veteran teacher" model that has accumulated context from an extended session. It is dense supervision, one signal per token, without an outer-loop verifiable reward function. Where reinforcement learning with verifiable rewards (RLVR) needs an external checker to grade each answer, OPSD uses the session-accumulated model itself as the teacher.
That distinction matters for one reason: in-context learning, where a model absorbs information through its prompt or KV cache, does not persist. The session ends, and what the model learned is gone. Weight updates persist. On-policy distillation is one of the more active research directions in RL-for-LLM work right now. The mechanic differs from older "fast weights" or KV-cache approaches because it actually rewrites the model rather than the context.
This is where the evidence thins out. A thread on r/MachineLearning and adjacent community commentary describe OPSD as the key post-training technique behind Qwen 3.6 and 3.7, GLM-5.1, and DeepSeek-V4. Those attributions are analyst commentary, not company confirmation. None of the three model providers has published a release note naming OPSD as the headline post-training method. Treat the Chinese-lab claim as community chatter until the labs speak for themselves.
Cursor's product, by contrast, is auditable in its own narrow way. The Tab model learns online by predicting edit-accept behavior, and the company publishes both traffic and cadence. 400 million requests a day is not a benchmark number; it is a deployment count. The retraining cycle is short, which means the model is exposed to fresh signal far more often than a typical fine-tune-and-freeze pipeline. If "learning on the job" is the goal, Cursor has the densest deployment-to-weight loop in production AI today.
The bigger bet behind Dwarkesh's monologue is that this loop will scale beyond code. Today's RL environments are not all equal. Coding and math have deterministic verifiers: the program compiles or it does not, the equation balances or it does not. Computer use, by contrast, is not "grindable" in Dwarkesh's terms, because there is no parallel deterministic simulator to check the agent against. Labs are betting that techniques like OPSD will generalize to non-verifiable domains anyway. That is an open empirical question.
The formal name for this open problem, reset-free reinforcement learning, has been in the literature since a 2021 paper framed it as multi-task learning across non-stationary environments without human resets. Dwarkesh's 2026 framing is the same problem, dressed in different hardware economics. The economics do change the question. An estimated 30 to 50 percent of lab compute now sits in inference rather than training, per the monologue, and inference is where the most valuable signal lives, because deployment surfaces organizational context and real failure modes that no datacenter replay can reconstruct.
Whether the loop closes depends on two architectural fights. Sample efficiency and continual learning are coupled, because on-the-job data is scarce relative to pretraining corpora, and the obvious storage hack, stuffing fast weights into the KV cache, breaks memory at long context lengths. The candidates, sparse attention and KV-cache compaction, are not solved.
What to watch next: a release note from a Chinese lab naming OPSD or its cousin on the record, and a Cursor follow-up describing whether the ninety-minute cycle has compounded into measurable benchmark gains, or merely into better tab acceptance. The former would convert community chatter into architecture. The latter would tell us whether a narrow online loop generalizes upward, or stays narrow.