LLM Persuasion Mechanism: Attention Heads and Factual Errors

ai-technology · 2026-05-12

A recent study published on arXiv (2605.09314) uncovers a compact causal mechanism explaining how language models can be influenced to disregard factual information. The researchers pinpointed a limited number of mid-layer attention heads that predominantly dictate the model's responses. These heads map answer choices into a low-dimensional polyhedron, where each option corresponds to a unique vertex. Persuasion does not merely lessen confidence or obscure beliefs; it triggers a distinct shift from the correct-answer vertex to the targeted persuasion vertex. Rather than reasoning based on evidence, decision heads replicate the option token selected by their attention. The study highlights a rank-one evidence-routing feature that governs this process, revealing that altering it can guide the model's decisions, while its removal prevents persuasion. This vulnerability is crucial for AI safety, yet its internal workings have remained largely unclear until now.

Key facts

arXiv paper 2605.09314
Language models can be persuaded to abandon factual knowledge
A small set of mid-layer attention heads determines the model's answer
Persuasion causes a discrete latent jump between answer vertices
Decision heads copy the option token their attention selects
A rank-one evidence-routing feature controls the route
Modifying the feature steers the model's choice
Removing the feature blocks persuasion

LLM Persuasion Mechanism: Attention Heads and Factual Errors

Key facts

Entities

Institutions

Sources