New Benchmark SH-Bench Evaluates Audio LLMs' Bystander Privacy Risks in Multi-Speaker Environments

ai-technology · 2026-04-22

Audio large language models deployed in real-world settings frequently capture unintended bystander speech, creating significant privacy vulnerabilities that current benchmarks and defenses fail to address. Researchers have developed SH-Bench, the first benchmark specifically designed to assess selective hearing capabilities in these models. This benchmark contains 3,968 multi-speaker audio mixtures combining both real-world and synthetic scenarios, accompanied by 77,000 multiple-choice questions that test models under general and selective operating modes. A novel metric called Selective Efficacy (SE) has been introduced to measure both multi-speaker comprehension and bystander privacy protection simultaneously. Evaluations of state-of-the-art open-source and proprietary LLMs demonstrate substantial privacy leakage from bystander speech. The research paper, identified as arXiv:2512.06380v3 with Announce Type replace-cross, highlights how existing systems inadequately protect incidental speech captured during normal operation. This work represents the first systematic approach to quantifying and addressing privacy risks in audio LLMs when multiple speakers are present. The benchmark's comprehensive design includes diverse audio scenarios to thoroughly test model behavior in complex auditory environments.

Key facts

Audio LLMs capture unintended bystander speech in real-world deployments
SH-Bench is the first benchmark for evaluating selective hearing in audio LLMs
Benchmark contains 3,968 multi-speaker audio mixtures
Includes 77,000 multiple-choice questions testing general and selective modes
Introduces Selective Efficacy (SE) metric for comprehension and privacy
Evaluations show substantial bystander privacy leakage in current models
Paper identifier: arXiv:2512.06380v3 with Announce Type replace-cross
Benchmark includes both real-world and synthetic audio scenarios

Entities

—

Sources

arXiv cs.AI — 2026-04-22