Safety Filters Watch the Output. This Researcher Says the Model Already Moved.

PREVIEWSafety Filters Watch the Output. This Researcher Says the Model Already Moved. · MD

An anonymous researcher has released a Reddit preprint, a public code repository, and a Zenodo data deposit, all built around a single, falsifiable claim. The Reddit title (PresentSituation8736 on r/MachineLearning) reads "Coherent Context Can Silently Shift LLMs Into a Different Internal Regime, And Current Safety Systems Are Blind To It." The supporting artifacts are a code repository (ngscode23/latent-space-shift-research on GitHub) and a data deposit (Zenodo record 20564350). The claim, in plain language: feeding a large language model (the class of system behind ChatGPT-style assistants) a coherent paragraph can shift its internal state into a different regime before the model produces a single output token, and current production safety systems miss this because they only read what the model says.

The model the researcher used is Gemma-3-12B-IT, an open-weights model from Google with full access to its internal activations. That choice is what makes the claim auditable. Any statement about a model's "internal state" only becomes checkable when the model is one that outside researchers can actually open up. The author says the same effect was observed in closed-source systems, but did not produce the same kind of internal-state evidence for them, an important scope limit to hold onto.

To understand what "internal state" means here, picture how a large language model produces text. Most modern LLMs are built from stacked layers of computation. As a prompt travels through those layers, each layer updates a long vector of numbers called the residual stream, the running representation the model carries forward about what the prompt means and what to say next. Different regions of that internal space correspond to different processing modes: factual recall, code generation, refusal behavior, and so on. The researcher's claim is that a coherent input paragraph can drag the residual stream from one such region into another, before the model emits its first token. "Coherent context" is the author's term for this hypothesized mechanism. It is not an established attack class, and the author is explicit that no jailbreak phrasing is used.

If the claim holds, the implication is not that AI safety is broken. It is that the dominant safety paradigm operates on the wrong side of the equation. Output classifiers and refusal training see what the model says. They do not see what the model was already doing internally, hundreds of milliseconds and many layers earlier, while it was still reading the input. That critique is not new. The most common way today's chatbots are trained to be helpful and refuse harmful requests is a technique called RLHF (reinforcement learning from human feedback), and a wider research community has argued for years that output-level evaluation is a poor proxy for what a model is actually doing.

The community in question is mechanistic interpretability, a small but serious field that treats the residual stream as a measurable object. Researchers in this space project activations, cluster them, and run causal interventions. A popular tool in the toolkit is the sparse autoencoder (SAE), a way of asking, of all the directions the residual stream could move in, which features the model is actually using. Related work on representation engineering and activation steering tries to nudge the model into one internal region or another by injecting or subtracting specific directions. The Reddit author sits inside this conceptual neighborhood, and the released code uses several of its techniques.

The specific measurements in the GitHub repository include hidden-state geometry and projections, residual-stream trajectories, contrastive controls that separate a paragraph's content from its word order, norm-controlled causal interventions, SAE readouts, and KL divergence against teacher-forced generation (a measure of how different the model's next-token probabilities become when the input pushes it off its expected trajectory). The Zenodo deposit fixes the data so a reanalysis does not depend on the author's machine. None of this is a controlled benchmark against a closed frontier model, and the author is upfront about that.

The "damage is already done" framing in the Reddit title is the author's rhetorical climax, not a measured conclusion. The honest version of the claim is narrower: under the author's specific protocol on Gemma-3-12B-IT, a coherent input paragraph appears to leave measurable fingerprints in the residual stream, and the techniques current production systems use to keep models in line did not catch them. Whether that result generalizes to other models, other prompt distributions, or real deployment traffic is a question the open artifacts are designed to let other researchers answer.

That is why the release matters more than the title. A Reddit preprint with a strong title would normally be a brief. The combination of code, data, a specific named model, and a list of named measurement techniques is what turns it from a manifesto into an auditable contribution. Anyone with the GPU memory to run Gemma-3-12B-IT can clone the repository, replay the measurements, and disagree in public with a graph.

What to watch next: whether the broader mechanistic interpretability community picks the artifacts up, whether any of the major model labs, which have their own internal interpretability teams, runs the same protocol on a closed model and reports, and whether the pattern survives outside Gemma-3-12B-IT. None of those questions is settled by the Reddit post. All of them are now, in principle, answerable in the open.

Safety Filters Watch the Output. This Researcher Says the Model Already Moved. — type0 | type0

Safety Filters Watch the Output. This Researcher Says the Model Already Moved.

Sources