Localizing Policy Circuits in Alignment-Trained Language Models
A recent preprint on arXiv (2604.04385v4) reveals a localized policy routing system within alignment-trained language models. An attention gate located in an intermediate layer identifies content and activates deeper amplifier heads that enhance refusal signals. In smaller models, both gate and amplifier function as single heads, while in larger models, they operate as bands across neighboring layers. Although the gate accounts for less than 1% of the output DLA, it is causally essential (p < 0.001). Interchange screening at n ≥ 120 uncovers the same pattern across twelve models from six different labs (ranging from 2B to 72B), though the specific heads differ by lab. Per-head ablation can reduce effectiveness by up to 58x at 72B, missing gates identified by interchange; at scale, interchange serves as the sole dependable audit. Continuous modulation of the detection-layer signal manages policy from outright refusal to evasion and factual responses. On safety prompts, this same intervention can transform refusal into harmful results.
Key facts
- arXiv:2604.04385v4
- Intermediate-layer attention gate and amplifier heads control refusal
- Gate contributes under 1% of output DLA but is causally necessary (p < 0.001)
- Interchange screening at n ≥ 120 detects motif in 12 models from 6 labs (2B to 72B)
- Per-head ablation weakens up to 58x at 72B and misses gates
- Modulating detection-layer signal controls policy from hard refusal to factual answering
- Same intervention turns refusal into harmful outputs on safety prompts
Entities
Institutions
- arXiv