Localizing Policy Circuits in Alignment-Trained Language Models

ai-technology · 2026-05-04

A recent preprint on arXiv (2604.04385v4) reveals a localized policy routing system within alignment-trained language models. An attention gate located in an intermediate layer identifies content and activates deeper amplifier heads that enhance refusal signals. In smaller models, both gate and amplifier function as single heads, while in larger models, they operate as bands across neighboring layers. Although the gate accounts for less than 1% of the output DLA, it is causally essential (p < 0.001). Interchange screening at n ≥ 120 uncovers the same pattern across twelve models from six different labs (ranging from 2B to 72B), though the specific heads differ by lab. Per-head ablation can reduce effectiveness by up to 58x at 72B, missing gates identified by interchange; at scale, interchange serves as the sole dependable audit. Continuous modulation of the detection-layer signal manages policy from outright refusal to evasion and factual responses. On safety prompts, this same intervention can transform refusal into harmful results.

Key facts

arXiv:2604.04385v4
Intermediate-layer attention gate and amplifier heads control refusal
Gate contributes under 1% of output DLA but is causally necessary (p < 0.001)
Interchange screening at n ≥ 120 detects motif in 12 models from 6 labs (2B to 72B)
Per-head ablation weakens up to 58x at 72B and misses gates
Modulating detection-layer signal controls policy from hard refusal to factual answering
Same intervention turns refusal into harmful outputs on safety prompts

Localizing Policy Circuits in Alignment-Trained Language Models

Key facts

Entities

Institutions

Sources