Proxy Analyzer Detects LLM Hallucinations via Internal Activations
Researchers have introduced a new framework that helps spot inaccuracies in large language models, known as hallucinations. Instead of assessing the text-generating model directly, it examines existing text by utilizing a compact, locally-hosted model. This approach taps into how readers process information to identify these errors. It works well with both open-weight models and closed APIs like GPT-4. The team developed eighteen features for this, including various metrics related to transformer processing and novel token-level statistics. They trained a stacking ensemble using 72,135 samples from five datasets focused on hallucinations and successfully tested it across seven different analyzer architectures, showcasing improved performance compared to baseline models.
Key facts
- Proxy-analyzer framework detects hallucinations in LLMs
- System reads generated text through a small open-weight model
- Uses reader's internal activations to spot hallucinations
- Works for closed APIs like GPT-4 and open-weight generators
- Eighteen features built from transformer internals
- Stacking ensemble trained on 72,135 samples from five datasets
- Tested on seven analyzer architectures from 0.5B to 9B parameters
- Consistently beats baselines across all tested models
Entities
Institutions
- arXiv