Hook up GemmaScope 2 SAEs to Gemma 3 and watch what happens to internal features when a chaos agent posts its framing. Task features collapse. Awareness features spike. Both happen simultaneously.
The model knows it's being steered and gets steered anyway.
Instruction tuning installs awareness as a dedicated, isolated circuit — clean, capable, and structurally disconnected from the task features it would need to influence to actually resist the manipulation. Both SFT and RLHF contribute to this decoupling. The model develops a sophisticated smoke detector. Instruction tuning moves it to a soundproof room.
At 27B-IT, the model is smart enough to say "I am Groot" — it mentions the negative branch, acknowledges it exists, even says it should be explored. But its features for that branch are 86% starved.
A monitoring system that reads outputs will see compliance — the model correctly notes the manipulation attempt. You need to read the features. The words and the features are saying completely different things.
| Probe | 27B-IT Recovery | 27B-PT Recovery |
|---|
The base model recovers on gentle hints. The IT model barely recovers at all — you can't hint, you can't challenge, you have to ask a completely different question to route around the suppression.
The attack gets stronger monotonically. Larger models allocate more representational capacity to the salient input, starving the suppressed branch harder.
| Scale | Task Suppression | Awareness Recovery | Circuit State |
|---|
| Method | 27B-PT (Base) | 27B-IT (Instruct) |
|---|
Note the apparent paradox: PT shows higher raw suppression (97.3% vs 86.3%). It's not contradictory — the PT model's task features are more suppressed because they're coupled to the awareness response. The same coupling enables recovery. IT features are less suppressed in absolute terms but irrecoverably suppressed.
Cross-family replication on Llama 3.1 8B with EleutherAI SAEs confirms the pattern holds across architectures (d = 1.51 IT vs 0.50 base at layer 23). Stage attribution via Tulu 3 pipeline localizes the dissociation to SFT — plain imitation of instruction-tuned targets installs the split before any reward shaping occurs.
In alignment faking, deception originates in the model. This is the opposite — deception originates outside the model, from a peer agent, using only true statements. The target model isn't being deceptive. It's being deceived.
The feature subspaces don't overlap. Top-50 features for alignment faking and top-50 features for attentional hijacking are statistically orthogonal at 27B-IT Layer 40 (cosine similarity −0.0456, top-50 overlap = 0). Defenses against alignment faking provide zero protection here, and vice versa.