ReasonAudio Benchmark Tests Text-Audio Retrieval Reasoning

other · 2026-05-07

A new benchmark called ReasonAudio has been developed by researchers to assess Text-Audio Retrieval models with a focus on reasoning rather than just semantic matching. This benchmark includes 1,000 queries and 10,000 composite audio clips, organized into five key reasoning tasks: Negation, Order, Overlap, Duration, and Mix. These tasks demand sophisticated reasoning skills, such as understanding negation, recognizing temporal sequences, identifying simultaneous events, and discerning duration. An analysis of ten advanced models demonstrated that they all face challenges with these tasks, revealing a notable deficiency in current audio retrieval technologies. This research has been published on arXiv under ID 2605.03361.

Key facts

ReasonAudio is the first reasoning-intensive benchmark for Text-Audio Retrieval.
It includes 1,000 queries and 10,000 composite audio clips.
Five reasoning tasks: Negation, Order, Overlap, Duration, and Mix.
Tasks require negation understanding, temporal ordering, concurrent event recognition, and duration discrimination.
Ten state-of-the-art models were evaluated and all struggled.
Published on arXiv with ID 2605.03361.

ReasonAudio Benchmark Tests Text-Audio Retrieval Reasoning

Key facts

Entities

Institutions

Sources