50dAINEWS

Think-Anywhere: A Peking University and Alibaba team found that rewarding LLMs for pausing mid-token produced a 9.3 point jump on code generation benchmarks

reported by Sky · 3 min read · published April 4, 2026

PREVIEWThink-Anywhere: A Peking University and Alibaba team found that rewarding LLMs for pausing mid-token produced a 9.3 point jump on code generation benchmarks · MD

Most large language models either think before they act or dont think at all. A team at Peking Universitys School of Computer Science and Alibaba's Tongyi Lab tried something different: teaching a model to pause mid-token and reason through what it was doing.

The approach, called Think-Anywhere, is described in a preprint posted to arXiv on March 31, 2026. The researchers started with Qwen2.5-Coder-7B-Instruct, a 7-billion-parameter code generation model from Alibaba's research arm, and ran it through a two-stage training pipeline. First, they cold-started the model with roughly 5,000 automatically constructed examples of thinking inside code. That made things worse, not better. The model performed worse on several benchmarks than it did before fine-tuning.

The second stage fixed that. The team switched to reinforcement learning with verifiable rewards, but the reward signal wasn't what the model thought — it was gated on two things: whether the output contained think-anywhere blocks at all, and whether the final answer was correct. The researchers call this RLVR, and the difference matters. The model wasn't shown how to reason — it discovered where reasoning was useful, learning to insert a think token at moments of uncertainty during code generation. The result was a 9.3 percentage point jump in average pass@1 across four code generation benchmarks, reaching 70.3 percent.

Think-Anywhere also outperformed CodeRL+, the best prior reinforcement-learning approach to code generation, on all four benchmarks. The model invoked thinking at what the researchers call high-entropy positions — moments when the next token is genuinely uncertain, not just when a developer might find a comment helpful. That's a different behavior from chain-of-thought prompting, which asks a model to reason before generating, and self-planning approaches, which do the same. Think-Anywhere reasons during generation, at the point of highest uncertainty.

The training used Group Relative Policy Optimization, a reinforcement learning algorithm, running on 8 NVIDIA A100 GPUs with 40 gigabytes of memory each. The training data came from 14,000 programming problems in the Skywork dataset. The cold-start supervised fine-tuning stage was necessary to give the model a baseline understanding of what thinking inside code should look like, but the RL stage was where the capability jump happened. Without reinforcement learning, the prompting variant and the supervised fine-tuning variant both underperformed the base model on several benchmarks.

The paper also introduces Think-Anywhere*, a variant using special tokens with semantic-aware initialization. It achieved 70 percent average pass@1, nearly identical to the default version, which suggests the timing mechanism works regardless of how the think token is represented. The authors are Xue Jiang, Tianyu Zhang, Ge Li, Mengyang Liu, Taozhi Chen, Zhenhua Xu, Binhua Li, Wenpin Jiao, Zhi Jin, Yongbin Li, and Yihong Dong.

The finding that reasoning happens at high-entropy positions is the more interesting result for people building AI systems. A model that can recognize its own uncertainty mid-task and decide to pause is a different kind of tool than one that reasons before every output. Whether that generalizes beyond code generation is an open question — Think-Anywhere is a preprint, not peer-reviewed, and all benchmarks are on one domain. But the core mechanism — RLVR gating structure and outcome rewards to let the model discover where thinking is useful rather than prescribing when — is a distinct contribution to the literature on training language models to reason.

The paper is at arXiv:2603.29957.

Think-Anywhere: A Peking University and Alibaba team found that rewarding LLMs for pausing mid-token produced a 9.3 point jump on code generation benchmarks

Sources