Research Identifies Specific Neurons Responsible for Citation Hallucinations in Large Language Models
A recent study released on arXiv (2604.18880v1) explores the reasons behind large language models' tendency to produce plausible yet entirely fabricated citations. Analyzing 108,000 generated references from nine distinct models, researchers found that author names were the most frequently fabricated element across all models and contexts. The style of citation had no significant impact on hallucination rates, and techniques aimed at enhancing reasoning actually worsened recall performance. By employing elastic-net regularization with stability selection on neuron-level CETT values of Qwen2.5-32B-Instruct, a limited number of field-specific hallucination neurons (FH-neurons) were identified. Causal intervention experiments showed that amplifying these neurons led to more hallucinations, while suppressing them improved overall performance, especially in certain fields. Probes designed to identify hallucinations in one area performed poorly in others, indicating that hallucination signals do not transfer across different citation components. The study also revealed that models often exhibit strong confidence in these erroneous references, despite their complete fabrication.
Key facts
- Study examines 108,000 generated references across 9 LLM models
- Author names fail most frequently across all models and settings
- Citation style has no measurable effect on hallucination rates
- Reasoning-oriented distillation degrades recall performance
- Hallucination detection doesn't generalize across different citation fields
- Researchers identified field-specific hallucination neurons in Qwen2.5-32B-Instruct
- Amplifying FH-neurons increases hallucination, suppressing them improves performance
- Models express high confidence in fictitious citations
Entities
Institutions
- arXiv