Hallucination Neurons Fail to Generalize Across Knowledge Domains in LLMs
A new study on arXiv (2604.19765v1) explores the adaptability of 'hallucination neurons' (H-neurons) found in feed-forward networks. These neurons, which make up less than 0.1% of the total, help signal when large language models are hallucinating. The researchers used a method that spanned six different areas: general QA, legal, financial, science, moral reasoning, and code vulnerability, along with five open-weight models that have between 3 billion and 8 billion parameters. The results showed that H-neurons don’t transfer well across different domains. While classifiers scored an AUROC of 0.783 in their original domain, the score dropped to 0.563 in a different domain, suggesting that the hallucination processes are unique to each specific area.
Key facts
- H-neurons are less than 0.1% of feed-forward network neurons.
- Study tested 6 domains: general QA, legal, financial, science, moral reasoning, and code vulnerability.
- 5 open-weight models from 3B to 8B parameters were used.
- Within-domain AUROC: 0.783.
- Cross-domain AUROC: 0.563.
- Delta: 0.220, p < 0.001.
- Degradation consistent across all models.
- Hallucination lacks a universal neural signature.
Entities
Institutions
- arXiv