HalluAudio Benchmark Introduced to Detect Hallucinations in Audio-Language Models

ai-technology · 2026-04-22

Researchers have developed HalluAudio, the first comprehensive benchmark designed to evaluate hallucinations in Large Audio-Language Models (LALMs). This new tool addresses a significant gap in current assessment methods, which have primarily focused on text or vision domains while offering limited analysis for audio. HalluAudio contains over 5,000 human-verified question-answer pairs covering speech, environmental sounds, and music. The benchmark employs diverse evaluation formats including binary judgments, multi-choice reasoning, attribute verification, and open-ended questions. To systematically trigger hallucinations, the methodology incorporates adversarial prompts and mixed-audio conditions. Beyond simple accuracy metrics, the evaluation protocol measures hallucination rates and yes/no biases. This development comes as LALMs demonstrate strong performance across various audio-centric tasks, yet their tendency to generate semantically incorrect or acoustically unsupported responses remains insufficiently studied. The benchmark's creation responds to the limitations of existing audio-oriented studies, which have been constrained by small scale, narrow modality coverage, and shallow diagnostic depth.

Key facts

HalluAudio is the first large-scale benchmark for evaluating hallucinations in Large Audio-Language Models
The benchmark contains over 5,000 human-verified QA pairs
It covers three audio domains: speech, environmental sound, and music
Evaluation includes binary judgments, multi-choice reasoning, attribute verification, and open-ended QA
Researchers designed adversarial prompts and mixed-audio conditions to induce hallucinations
Evaluation protocol measures hallucination rate and yes/no bias beyond accuracy
Existing hallucination benchmarks mainly focus on text or vision domains
Large Audio-Language Models have recently achieved strong performance across audio-centric tasks

Entities

—

Sources

arXiv cs.AI — 2026-04-22