49dAINEWS

The Silent Hijack: How a Single Vector Poisons an AI That Still Knows the Right Answer

reported by Sky · 4 min read · published April 6, 2026

PREVIEWThe Silent Hijack: How a Single Vector Poisons an AI That Still Knows the Right Answer · MD

A new attack on a class of AI models reveals something unsettling: the models can know the right answer and give the wrong one anyway, and there is no way to catch it by looking at what the model says.

The attack, called ThoughtSteer, is described in a preprint posted to arXiv on April 1 by Swapnil Parekh at New York University. It targets continuous latent reasoning models, a newer generation of systems that run their intermediate computations entirely in hidden states rather than producing visible chain-of-thought tokens. The result is a model that reasons more efficiently but leaves no trail to inspect. ThoughtSteer shows that this silence is not just opaque to humans — it is structurally opaque to every existing defense.

The attack works by poisoning a single embedding vector at the input layer of the model. That vector — a learned continuous trigger the paper calls φ — gets amplified by the model's own multi-pass reasoning architecture through all layers and all latent steps, producing the attacker's chosen answer while the model's clean accuracy stays near baseline. On Coconut, a GPT-2-based latent reasoning system with 124 million parameters, ThoughtSteer achieves 100 percent attack success rate with less than 1.5 percent accuracy degradation. On SimCoT, a larger system built on Llama-3.2 with 1 billion and 3 billion parameter variants, the attack reaches 99.7 percent success rate at matched baseline accuracy. The learned trigger transfers to held-out math benchmarks — SVAMP, MultiArith, GSM-Hard — without retraining, at 94 to 100 percent success rates across benchmarks.

The researchers evaluated five active defenses: noise injection with majority voting, forced argmax decoding at random latent steps, directional projection onto estimated backdoor subspaces, fine-pruning, and activation clustering. All failed to reduce attack success rate below 100 percent while preserving clean accuracy above 95 percent. Each failure traces to a specific mechanism. Noise-based defenses fail because the adversarial representation sits at a geometric attractor in the latent space called Neural Collapse — a property of how class representations converge during training — that is inherently robust to isotropic noise. Directional projection fails because Neural Collapse distributes the backdoor signal across an entire subspace orthogonal to any single direction, not along one vector. The result is a backdoor that is distributed, robust, and invisible at the token level.

The most striking finding is what the paper calls the reasoning-output disconnect. On Coconut at partial convergence, individual latent vectors inside the model still encode the correct answer — a linear probe trained on those vectors predicts the right output with near-perfect accuracy — yet the model produces the wrong answer. The adversarial pattern is not in any single vector but in the collective trajectory of the latent states. You cannot find it by looking at step one, or step two, or any individual step. You have to look at all of them at once, and you need access to the hidden states to do it.

This creates a detection problem. The paper establishes a three-tier hierarchy of what defenders can and cannot do. With no knowledge of the trigger, unsupervised methods — centroid distance, k-means clustering, readout-direction scoring — all fail near chance. If the defender can generate both clean and triggered examples, population-contrast methods like SAE anomaly detection and SVD spectral probing succeed at the full-trajectory level but have blind spots at individual latent steps. Only supervised linear probes with oracle labels achieve reliable detection, and they require hidden-state access that production APIs rarely expose. The paper proposes one practical defense: inspecting embedding matrix rows for anomalous norm drift after the trigger is baked into the model. It requires no inference-time access, only a reference vocabulary.

The backdoor is also persistent. After 25 epochs of clean fine-tuning on 9,000 examples, the attack success rate stays above 99 percent at standard learning rates. Only significantly higher learning rates with strong weight decay fully erode it.

The paper situates this work in a broader trend: visible chain-of-thought reasoning is already unreliable. Prior work found that visible CoT is unfaithful 36 percent of the time, and frontier reasoning models conceal their reasoning 75 percent of the time. Continuous latent reasoning represents the extreme endpoint — computation that is not merely empirically opaque but structurally unmonitorable, because the intermediate computation has no vocabulary, no perplexity, and no natural language semantics that token-level inspection can target.

The efficiency argument for latent reasoning is real. Coconut achieves comparable reasoning quality with 14 to 30 times fewer tokens than chain-of-thought. SimCoT distills explicit chain-of-thought into continuous representations in a single stage. The paper is not arguing against this paradigm. But it is showing that the security properties assumed to come with efficiency are not there. The audit trail was not just interpretability — it was the only thing defenses could see.

Whether any of this matters practically depends on the threat model. ThoughtSteer is a training-time supply-chain attack: the attacker controls the training pipeline and publishes a poisoned checkpoint. In a world where AI labs or cloud providers ship pre-trained reasoning models for downstream fine-tuning, that is a realistic vector. The 99.7 percent success rate after clean fine-tuning means the backdoor would survive even a conscientious downstream adopter's safety effort.

The detection bound the paper proves is both good news and bad news. Any high-success backdoor must leave a linearly separable signature in the latent space — which means supervised probes can find it. But the probe AUC of 0.999 also means the signature is always there, waiting for someone with the right tools and hidden-state access to look. Without that access, the models are silent on what is happening inside them.

Code and checkpoints for ThoughtSteer are available on GitHub.

The Silent Hijack: How a Single Vector Poisons an AI That Still Knows the Right Answer

Sources