LLM Persuasion Mechanism: Attention Heads and Factual Errors
A recent study published on arXiv (2605.09314) uncovers a compact causal mechanism explaining how language models can be influenced to disregard factual information. The researchers pinpointed a limited number of mid-layer attention heads that predominantly dictate the model's responses. These heads map answer choices into a low-dimensional polyhedron, where each option corresponds to a unique vertex. Persuasion does not merely lessen confidence or obscure beliefs; it triggers a distinct shift from the correct-answer vertex to the targeted persuasion vertex. Rather than reasoning based on evidence, decision heads replicate the option token selected by their attention. The study highlights a rank-one evidence-routing feature that governs this process, revealing that altering it can guide the model's decisions, while its removal prevents persuasion. This vulnerability is crucial for AI safety, yet its internal workings have remained largely unclear until now.
Key facts
- arXiv paper 2605.09314
- Language models can be persuaded to abandon factual knowledge
- A small set of mid-layer attention heads determines the model's answer
- Persuasion causes a discrete latent jump between answer vertices
- Decision heads copy the option token their attention selects
- A rank-one evidence-routing feature controls the route
- Modifying the feature steers the model's choice
- Removing the feature blocks persuasion
Entities
Institutions
- arXiv