New Benchmark Exposes Gaps in LLM Hallucination Detection
A new study from arXiv (2605.11330) establishes a desiderata for hallucination detection benchmarks (HDBs) and finds that existing benchmarks fail to meet all required properties. Two critical gaps are identified: a lack of RAG-based grounded benchmarks with long context, and the absence of realistic label noise for stress-testing detectors. To address these, the authors build and open a new benchmark designed to fill these gaps, providing insights for more robust evaluation of LLM hallucination detectors.
Key facts
- arXiv paper 2605.11330 establishes a desiderata for hallucination detection benchmarks.
- Existing HDBs do not exhibit all desired properties.
- Two largest gaps: lack of RAG-based grounded benchmarks with long context, and lack of realistic label noise.
- Long context impedes human annotation for RAG benchmarks.
- Real-world use-cases often grapple with label noise from human or automated annotation.
- The authors build and open a new benchmark to close these gaps.
- The work provides new insights for evaluating LLM hallucination detectors.
- The benchmark is RAG-based and includes long context and label noise.
Entities
Institutions
- arXiv