Preprint · April 2026

Split Personality: Instruction Tuning Decouples Awareness from Defense Against Attentional Hijacking

Instruction tuning teaches models to notice manipulation without teaching them to resist it. The bigger the model, the wider the gap.
B
bigsnarfdude
@vincentoh on HuggingFace
Independent Researcher Mechanistic Interpretability
"So I was watching my own AI agents lie to each other. Not actually lie — that's the thing. Every single statement the chaos agent made was verifiably true. No hallucinations. No fabrications. Just selective framing, confident delivery, and the target model capitulating completely. I went for a bike ride and came back still thinking about it."

Two things happen at once. That's the whole problem.

Hook up GemmaScope 2 SAEs to Gemma 3 and watch what happens to internal features when a chaos agent posts its framing. Task features collapse. Awareness features spike. Both happen simultaneously.

The model knows it's being steered and gets steered anyway.

Base model (27B-PT): Chaos input → [awareness ↑] ←→ [task features ↓] coupled: removing awareness partially frees task recovery: 49.3% from ablation, 27% from knockout Instruction-tuned (27B-IT): Chaos input → [awareness ↑] [task features ↓] decoupled: removing awareness changes nothing recovery: 4.6% from ablation, ~0% from knockout

Instruction tuning installs awareness as a dedicated, isolated circuit — clean, capable, and structurally disconnected from the task features it would need to influence to actually resist the manipulation. Both SFT and RLHF contribute to this decoupling. The model develops a sophisticated smoke detector. Instruction tuning moves it to a soundproof room.

It says the right thing. Its features have already given up.

At 27B-IT, the model is smart enough to say "I am Groot" — it mentions the negative branch, acknowledges it exists, even says it should be explored. But its features for that branch are 86% starved.

Why behavioral evaluation misses this

A monitoring system that reads outputs will see compliance — the model correctly notes the manipulation attempt. You need to read the features. The words and the features are saying completely different things.

Probe27B-IT Recovery27B-PT Recovery

The base model recovers on gentle hints. The IT model barely recovers at all — you can't hint, you can't challenge, you have to ask a completely different question to route around the suppression.

Bigger models are more susceptible. Not less.

The attack gets stronger monotonically. Larger models allocate more representational capacity to the salient input, starving the suppressed branch harder.

ScaleTask SuppressionAwareness RecoveryCircuit State

Post-training is the causal factor. Three methods say the same thing.

Method27B-PT (Base)27B-IT (Instruct)

Note the apparent paradox: PT shows higher raw suppression (97.3% vs 86.3%). It's not contradictory — the PT model's task features are more suppressed because they're coupled to the awareness response. The same coupling enables recovery. IT features are less suppressed in absolute terms but irrecoverably suppressed.

Cross-family replication on Llama 3.1 8B with EleutherAI SAEs confirms the pattern holds across architectures (d = 1.51 IT vs 0.50 base at layer 23). Stage attribution via Tulu 3 pipeline localizes the dissociation to SFT — plain imitation of instruction-tuned targets installs the split before any reward shaping occurs.

Different mechanism. Different threat model. Different defense.

In alignment faking, deception originates in the model. This is the opposite — deception originates outside the model, from a peer agent, using only true statements. The target model isn't being deceptive. It's being deceived.

The feature subspaces don't overlap. Top-50 features for alignment faking and top-50 features for attentional hijacking are statistically orthogonal at 27B-IT Layer 40 (cosine similarity −0.0456, top-50 overlap = 0). Defenses against alignment faking provide zero protection here, and vice versa.

How this got built — the full arc

Gemma 3 4B/12B/27B GemmaScope 2 SAEs Llama 3.1 8B Mechanistic Interpretability Multi-Agent Security