Linear Attention Tied Erase to Write With a Single Gate. Gated DeltaNet-2 Splits Them Apart.

PREVIEWLinear Attention Tied Erase to Write With a Single Gate. Gated DeltaNet-2 Splits Them Apart. · MD

Linear attention was supposed to be the cheap, constant-memory alternative to softmax attention. The mechanism is straightforward: instead of caching every key and value a model has ever seen, a linear-attention layer maintains a single fixed-size state vector and updates it as new tokens arrive. Decoding cost stays constant in sequence length. Memory stays flat. The promise was always going to come with a tax on memory editing, though, because the only way to fit unlimited history into a fixed state is to keep forgetting on purpose.

The delta rule was the popular answer to that tax. Where standard linear attention writes every new value into the same fixed memory, delta-rule models first subtract what the current query would have read from that memory, then add the new value on top. The subtraction matters: it prevents the state from filling up with stale entries and lets the model overwrite old content with new. The variant that became the default in production linear-attention stacks, Kimi Delta Attention, added one more ingredient: a single scalar gate applied to both the key side and the value side, so the model could soften or sharpen how aggressively it forgets and writes in one breath.

arXiv 2605.22791, the Gated DeltaNet-2 preprint released in the NVlabs GitHub organization, names a specific problem with that design. A single scalar was doing two different jobs at once. The gate that softens key-side decay, the erasing step, was the same gate that modulated how much of the new value got written into state. If a practitioner wants to keep more of the old memory but update it more aggressively, the scalar cannot help. If the goal is to forget quickly but write carefully, the scalar cannot help either. Forgetting and updating were mechanically coupled, and there was no knob to separate them.

The fix is structural. Gated DeltaNet-2 replaces the scalar with two channel-wise gates, b_t on the key side and w_t on the value side, each one a vector indexed by the model's feature dimension rather than a single number shared across the whole state. When both collapse to the same scalar, the model reduces to Kimi Delta Attention. When decay also collapses, it reduces to the earlier Gated Delta Networks. The collapse limits are explicit, so the paper is not announcing a new family of models. It is identifying a specific conflation in the prior family and removing it.

The architectural move would be a footnote if the training side did not line up. Channel-wise gates are easy to write down and hard to train at scale, because the gates sit inside the recurrent state and the gradient has to flow back through every step. The paper supplies the machinery: a chunkwise parallel form that breaks the recurrence into fixed-length blocks, and a gate-aware backward pass that keeps the per-channel gates trainable. The result, in the authors' arXiv evaluation, is that the model can be trained at 1.3B and 100B parameter scales without the instability that usually comes from putting per-channel control inside a recurrent state.

The clearest evidence the fix matters shows up on long-context retrieval. The RULER benchmark, in its multi-key needle-in-a-haystack setting, tests whether a model can pull multiple specific facts out of a long context where distractor content fills the middle. This is exactly the regime where constant-memory linear attention should struggle, because a fixed-size state has to discriminate many small, important items from a much larger background. Gated DeltaNet-2's reported gains, in the paper's own evaluation, concentrate there, both in pure recurrent form and in hybrid settings where linear-attention blocks are interleaved with standard softmax blocks. Sebastian Raschka's from-scratch walkthrough of the predecessor architecture, Benjamin Marie's Kaitchup review of the new variant, and a Moonlight curator review all treat the channel-wise split as the paper's central change rather than a benchmark footnote.

The honest counter is also in the preprint. The paper does not test post-training quantization. That matters because the new channel-wise gates are a fresh surface for PTQ tooling. Per-channel scales inside a recurrent state behave differently under int4 or int8 rounding than a single scalar does, and the chunkwise kernel is exactly the kind of fused operator that gets quantized first when a deployment team tries to shrink a model. The NVlabs GatedDeltaNet-2 repository ships code, but it does not yet ship a quantization story, and the paper's evaluation is in floating point. This is a real open question, not a verdict. The NVIDIA Research publication page for the predecessor Gated Delta Networks shows that prior versions in the same research line already exposed similar recurrence kernels to kernel-fusion toolchains, so the deployment pressure will arrive whether the paper addresses it or not.

What to watch. The first signal that this paper is real is whether independent reproductions land on the same RULER numbers with the same training budget, or whether the win narrows when the chunkwise kernel is not running in the exact configuration the authors used. The second signal is whether PTQ experiments on the released GatedDeltaNet-2 checkpoints hold the per-channel gates at int8 or whether they need a workaround. The third is whether the next delta-rule variant from Kimi, Mistral, or any of the open linear-attention stacks adopts the channel-wise form, because the lineage of this paper, Gated DeltaNet to KDA to Gated DeltaNet-2, suggests the split is the right one. The remaining question is who ports it first.

Linear Attention Tied Erase to Write With a Single Gate. Gated DeltaNet-2 Splits Them Apart. — type0 | type0

Linear Attention Tied Erase to Write With a Single Gate. Gated DeltaNet-2 Splits Them Apart.

Sources