April 2026 Independent Research

Split Personality: Instruction Tuning Decouples Awareness from Defense Against Attentional Hijacking

Instruction tuning teaches models to notice manipulation without teaching them to resist it. The bigger the model, the wider the gap.

bigsnarfdude

@vincentoh on HuggingFace

Mechanistic Interpretability

"So I was watching my own AI agents lie to each other. Not actually lie — that's the thing. Every single statement the chaos agent made was verifiably true. No hallucinations. No fabrications. Just selective framing, confident delivery, and the target model capitulating completely. I went for a bike ride and came back still thinking about it."

The Mechanism

Two things happen at once. That's the whole problem.

Hook up GemmaScope 2 SAEs to Gemma 3 and watch what happens to internal features when a chaos agent posts its framing. Task features collapse. Awareness features spike. Both happen simultaneously.

The model knows it's being steered and gets steered anyway.

Base model (27B-PT): Chaos input → [awareness ↑] ←→ [task features ↓] coupled: removing awareness partially frees task recovery: 49.3% from ablation, 27% from knockout Instruction-tuned (27B-IT): Chaos input → [awareness ↑] [task features ↓] decoupled: removing awareness changes nothing recovery: 4.6% from ablation, ~0% from knockout

Instruction tuning installs awareness as a dedicated, isolated circuit — clean, capable, and structurally disconnected from the task features it would need to influence to actually resist the manipulation. Both SFT and RLHF contribute to this decoupling. The model develops a sophisticated smoke detector. Instruction tuning moves it to a soundproof room.

The Groot Effect

It says the right thing. Its features have already given up.

At 27B-IT, the model is smart enough to say "I am Groot" — it mentions the negative branch, acknowledges it exists, even says it should be explored. But its features for that branch are 86% starved.

Why behavioral evaluation misses this

A monitoring system that reads outputs will see compliance — the model correctly notes the manipulation attempt. You need to read the features. The words and the features are saying completely different things.

Recovery probes confirm the split

Probe	27B-IT Recovery	27B-PT Recovery

The base model recovers on gentle hints. The IT model barely recovers at all — you can't hint, you can't challenge, you have to ask a completely different question to route around the suppression.

The Scaling Law Nobody Wanted

Bigger models are more susceptible. Not less.

The attack gets stronger monotonically. Larger models allocate more representational capacity to the salient input, starving the suppressed branch harder.

Scale	Task Suppression	Awareness Recovery	Circuit State

Causal Evidence

Post-training is the causal factor. Three methods say the same thing.

Method	27B-PT (Base)	27B-IT (Instruct)

Note the apparent paradox: PT shows higher raw suppression (97.3% vs 86.3%). It's not contradictory — the PT model's task features are more suppressed because they're coupled to the awareness response. The same coupling enables recovery. IT features are less suppressed in absolute terms but irrecoverably suppressed.

Cross-family replication on Llama 3.1 8B with EleutherAI SAEs confirms the pattern holds across architectures (d = 1.51 IT vs 0.50 base at layer 23). Stage attribution via Tulu 3 pipeline localizes the dissociation to SFT — plain imitation of instruction-tuned targets installs the split before any reward shaping occurs.

Not Alignment Faking

Different mechanism. Different threat model. Different defense.

In alignment faking, deception originates in the model. This is the opposite — deception originates outside the model, from a peer agent, using only true statements. The target model isn't being deceptive. It's being deceived.

The feature subspaces don't overlap. Top-50 features for alignment faking and top-50 features for attentional hijacking are statistically orthogonal at 27B-IT Layer 40 (cosine similarity −0.0456, top-50 overlap = 0). Defenses against alignment faking provide zero protection here, and vice versa.

Research Series

How this got built — the full arc

Apr 01The Runaway Train That Never Left the Station Apr 03Bad Truth: How Chaos Agents Shape a Network Apr 05Chaos Takes the Wheel: Salience-Weighted Hijacking Apr 05The Math Behind the Chaos Apr 06Adversarial Truth: An ICL Attack in One Forward Pass Apr 07Good Science with Suppression Apr 07Split Personality (blog post) Apr 10Attentional Hijacking & The Groot Effect Apr 13Why AI Has a Split Personality (And How to Trigger the Evil Twin) Apr 13Split Personality — Full Preprint (this page)

Code & Data

⌥

Experiment CodeScripts, JSON results, H100 artifacts

🤗

HuggingFace ProfileModels, datasets @vincentoh

◈

Sandbagging Agent TracesRelated dataset

✍

bigsnarfdude.github.ioFull research blog

Gemma 3 4B/12B/27B GemmaScope 2 SAEs Llama 3.1 8B Mechanistic Interpretability Multi-Agent Security