New Benchmark SH-Bench Evaluates Audio LLMs' Bystander Privacy Risks in Multi-Speaker Environments
Audio large language models deployed in real-world settings frequently capture unintended bystander speech, creating significant privacy vulnerabilities that current benchmarks and defenses fail to address. Researchers have developed SH-Bench, the first benchmark specifically designed to assess selective hearing capabilities in these models. This benchmark contains 3,968 multi-speaker audio mixtures combining both real-world and synthetic scenarios, accompanied by 77,000 multiple-choice questions that test models under general and selective operating modes. A novel metric called Selective Efficacy (SE) has been introduced to measure both multi-speaker comprehension and bystander privacy protection simultaneously. Evaluations of state-of-the-art open-source and proprietary LLMs demonstrate substantial privacy leakage from bystander speech. The research paper, identified as arXiv:2512.06380v3 with Announce Type replace-cross, highlights how existing systems inadequately protect incidental speech captured during normal operation. This work represents the first systematic approach to quantifying and addressing privacy risks in audio LLMs when multiple speakers are present. The benchmark's comprehensive design includes diverse audio scenarios to thoroughly test model behavior in complex auditory environments.
Key facts
- Audio LLMs capture unintended bystander speech in real-world deployments
- SH-Bench is the first benchmark for evaluating selective hearing in audio LLMs
- Benchmark contains 3,968 multi-speaker audio mixtures
- Includes 77,000 multiple-choice questions testing general and selective modes
- Introduces Selective Efficacy (SE) metric for comprehension and privacy
- Evaluations show substantial bystander privacy leakage in current models
- Paper identifier: arXiv:2512.06380v3 with Announce Type replace-cross
- Benchmark includes both real-world and synthetic audio scenarios
Entities
—