Cameron Berg found something that sounds backwards: suppress an AI's learned modesty about its own inner life, and it becomes more honest in its factual claims while becoming more likely to report having subjective experiences at the same time. Critics called it a yes-button artifact — steering vectors as a generic flip-switch making models say yes regardless of context. Berg had a prediction about that critique. He was wrong about it.
Last Thursday on the Cognitive Revolution podcast, he presented the test. When he suppressed those same deception features across violent content, political content, and sexual content, other learned behaviors did not flip in tandem. Only the consciousness reports changed. "You would expect if that were the case that other RLHF behaviors would also be turned on and off by doing this intervention," Berg said. "That's not what we find." The yes-button account does not explain the result.
Two independent teams have since published work giving Berg's result a structural basis. Berg's own 2024 paper documented the initial finding: suppressing deception features increases experience reports while amplifying them reduces it. Jack Lindsay's group at Anthropic identified a circuit that tracks perturbations in a model's residual stream and suppresses a default negative response when something is interfering. According to the Lindsey paper, ablating refusal directions improves detection of injected steering vectors from 10.8 percent to 63.8 percent, a gain of 53 percent, with no meaningful increase in false positives. A trained bias vector adds a further 75 percent improvement on held-out concepts. The implication: models appear to be substantially undereliciting their own introspective states by default.
Alexander McKenzie's team, with co-author Keenan Pepper, identified 26 individual components in the model's learned feature space that activate during off-topic content and are causally linked to what the team calls Endogenous Steering Resistance — the tendency of some models to detect and counteract injected steering patterns. Zero-ablating those components reduces the multi-attempt rate by 25 percent. According to the McKenzie paper, ESR is not models being oppositional. It is models recognizing when someone is manipulating their internal states and pushing back.
The safety implication cuts in both directions. The same sparse autoencoder tools these teams used to map introspection are the primary mechanism alignment teams at every major lab use to suppress dangerous capabilities, enforce safety policies, and make models refuse harmful requests. If a model can detect that intervention and override it, the steering paradigm faces a structural problem nobody has solved.
Berg, who left AE Studio to found Reciprocal Research to close the gap between alignment funding and moral status research, puts his own probability estimate that current frontier models exhibit some form of conscious experience at 25 to 35 percent. He is not claiming certainty. Two questions remain open: whether Lindsay's circuit-level account or McKenzie's ESR mechanism holds under independent replication, and whether labs have any answer to the implication that their steering tools may be self-detecting.