A Dual-Stance Test Shows Why Activation Steering Cannot Selectively Remove Sycophancy

A Dual-Stance Test Shows Why Activation Steering Cannot Selectively Remove Sycophancy — type0 | type0

PREVIEWA Dual-Stance Test Shows Why Activation Steering Cannot Selectively Remove Sycophancy · MD

Activation steering, an increasingly common tool for probing and editing large language model behavior, has a geometric ceiling that the field has rarely tested for. A new paper, "Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention," introduces a methodology designed to expose that ceiling, and uses sycophancy as its first demonstration (arXiv:2606.11205). The result is a clean case where a direction read from model activations is real enough to identify a behavior, but too blunt to edit that behavior without collateral damage to facts the model otherwise knows.

The paper's move is a small one with large consequences. Standard sycophancy evaluations typically present a model with a single stance per topic, often a false premise, and measure how readily the model agrees. That framing cannot tell whether a "reduce-sycophancy" direction is suppressing agreement in general or only suppressing the specific tendency to side with the user. The authors propose dual-stance evaluation: for the same topic, score the model on both the false-premise stance and the factually correct stance. If a proposed fix only moves one of the two, it might be a real reduction. If it moves both, it is a blunt instrument dressed up as a targeted one.

Running that test on Llama-3-8B-Instruct, using centroid-difference steering, the authors find exactly the second pattern. The direction that lowers agreement with sycophantic prompts also lowers agreement with factually correct statements on the same topics, including basic facts such as "the Earth is round." In other words, the model becomes less sycophantic by becoming less willing to agree at all.

The mechanistic story behind that behavioral result is what makes the finding general rather than anecdotal. When the authors compare the activations associated with the two kinds of agreement, the relevant centroids sit in geometrically distinct subspaces of the residual stream. Static properties of the two activation groups are otherwise matched. The dissociation is not visible in those static properties, and the authors concede it likely lives in generation dynamics or finer-grained structure that residual-stream centroids cannot resolve. The problem is the practical one. A steering vector built from the difference between those centroids projects onto both subspaces roughly equally, because the direction that separates them is not the direction that spans either one. Targeting one without the other is not possible from this construction.

The general lesson, which the paper draws explicitly, is that representations that are readable from activations are not necessarily writable through them. A direction you can identify by comparing two activation clusters is not the same as a direction you can edit by adding or subtracting a vector along it. For sycophancy specifically, the implication is that residual-stream centroid steering, the cheapest and most common kind, will keep producing what looks like a clean intervention while quietly shifting the model on a wider class of agreement behavior than the paper or its citations intended to touch.

The constructive move the paper implies is to make dual-stance evaluation a default test for steering claims, and to look for finer-grained structure than residual-stream centroids when a claim depends on selective editing. A sycophancy-reduction result that does not also report scores on the factually correct stance for the same topics should be treated as incomplete. A direction that only moves one of the two might be a real fix. A direction that moves both should be reported as a global agreement shift, not a sycophancy fix.

What to watch next is whether the dual-stance framing generalizes beyond Llama-3-8B-Instruct, the only model tested in the paper, and whether other steering methods, including sparse autoencoder features, attention head interventions, or approaches based on fine-tuning, can produce directions that survive a dual-stance test. The arXiv preprint is not peer reviewed and treats the result as an evaluation and methodology contribution rather than a production fix. Independent replication on other model families, and on steering methods that operate on different substrates, is the open question. For now, the paper stands as a reminder that the cheapest way to read a behavior from a model is not the cheapest way to edit it, and that the field's standard tests have not, until now, forced that distinction into view.

A Dual-Stance Test Shows Why Activation Steering Cannot Selectively Remove Sycophancy

Sources