The standard way to decide which fine-tuning method makes a small language model reason better is to run a benchmark and read the leaderboard. A June 2026 preprint titled "Weight-Space Geometry of Offline Reasoning Training" argues that approach misses the actual story. Six methods used to distill reasoning into a 4-billion-parameter model landed within statistical noise of each other on grade-school math accuracy. The interesting variation, the authors say, lives inside the weights themselves, where the methods that look interchangeable on a scorecard look like they are doing fundamentally different things.
The setup is deliberately narrow, and the authors say so. They start with Qwen3-4B, a single open base model, fine-tune it on the same set of math reasoning rollouts using six different training objectives, and restrict the training to LoRA adapters on the attention layers, leaving the rest of the model untouched. LoRA, short for low-rank adaptation, is a parameter-efficient fine-tuning trick that trains a small additive weight update instead of retraining the whole network. Reading those small weight deltas directly gives the authors 144 modules to compare, one per attention layer per method. The toolkit is straightforward geometry: cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA, a kernel-based measure of representational similarity. None of this is novel machinery. What is novel is the question they are using it to ask.
The headline result is uncomfortable. SFT (plain supervised fine-tuning on correct solutions), RFT (rejection-sampled fine-tuning, which filters to keep only correct traces), and RIFT (a reward-weighted variant) all move the weights in nearly the same direction. Cosine similarity of their weight deltas sits at 0.97 or higher, and the top principal angle between their update subspaces is on the order of 7 degrees, the median across the 144 modules. On GSM8K, a standard grade-school math word-problem benchmark, all three land in a tight 87 to 88 percent accuracy band, and pairwise McNemar tests come back with p-values of 0.15 or higher, well above the usual 0.05 threshold for declaring a winner. The authors are explicit that "no statistically significant difference" is not the same as "equivalent," and they want readers to register that distinction. In their framing, SFT, RFT, and RIFT are not three different ways to teach a model to reason. They are three labels for what is mechanically the same update.
DFT, a method that adds a decoupled term intended to suppress noisy or unhelpful tokens, diverges further in weight-space direction than any of the reward-weighted methods, even though it trains on the same data. That is a useful tell: if a method claims to fix a known failure mode of the others, the weights should move, and they do.
The two findings that do the most work for the paper's argument sit further along the spectrum. Offline GRPO, a group-relative policy optimization method adapted to run on a fixed dataset rather than fresh rollouts, adds a substantial component that is orthogonal to the SFT direction. Roughly 67 percent of its weight change, on average across the model, lies outside the SFT update subspace, and that share rises to about 86 percent in the late layers. Yet Offline GRPO still keeps the model inside the SFT loss basin, meaning the final accuracy is comparable. The reasonable read is that Offline GRPO is not a refinement of SFT. It is a partial rebuild that happens to land in a similar place on the benchmark.
The most aggressive claim is reserved for DPO, direct preference optimization, a method that trains the model to prefer correct over incorrect answers rather than to imitate correct answers outright. DPO's weight updates sit in a near-orthogonal subspace to the SFT direction and exhibit a linear mode-connectivity barrier, the standard way of saying there is no low-loss path between the two solutions in weight space. The practical implication is that DPO, on this setup, is not fine-tuning a model that already learned to reason. It is constructing a reasoning model from a different starting point in the loss landscape, with a different internal geometry. The HTML version of the paper truncates the abstract before the authors finish the mode-connectivity claim in full, so the strongest version of this point should be read against the full PDF.
For practitioners, the actionable reframe is uncomfortable. Accuracy-only leaderboards cannot tell you whether you have changed a model in a way that will transfer, scale, or compose with later training. The paper's proposal is to read the weight deltas themselves: cheap, mechanistic, and the only ground truth that does not require waiting for a downstream task to be invented. The honest caveat is that the entire analysis lives inside one base model, one task domain, and a parameter-efficient fine-tuning regime, and the authors are clear that generalization beyond those boundaries is not established by this paper alone. As a June 2026 preprint with no independent replication, it is a working hypothesis with unusually good geometry behind it, not a settled finding.
That is the watch item. If subsequent work on a different model family, a code-reasoning task, or a full-fine-tuning regime shows the same clustering of SFT, RFT, and RIFT in weight space, the field's instinct to treat reward-weighted fine-tuning as a measurable upgrade over plain SFT will need a new defense. If it does not, this paper will read as a clean methodological provocation that did not survive contact with broader conditions.