ATIR: New Benchmark for Audio-Text Interleaved Retrieval

other · 2026-04-24

A new task called Audio-Text Interleaved contextual Retrieval (ATIR) has been introduced by researchers, allowing queries to switch between audio and text formats. They have developed a benchmark that combines ASR, QA, and retrieval datasets, effectively merging four distinct types of contextual retrieval challenges. This research tackles the shortcomings of current audio retrieval datasets in semantic retrieval. The team assesses various off-the-shelf retrievers and trains an ATIR model utilizing a Multimodal Large Language Model.

Key facts

ATIR stands for Audio-Text Interleaved contextual Retrieval.
Queries can alternate between audio and text modalities.
Benchmark integrates ASR, QA, and retrieval datasets.
Unifies four types of contextual retrieval tasks.
Addresses limitations of existing audio retrieval datasets.
Evaluates several off-the-shelf retrievers.
ATIR model is based on a Multimodal Large Language Model.
Published on arXiv with ID 2604.20267.

ATIR: New Benchmark for Audio-Text Interleaved Retrieval

Key facts

Entities

Institutions

Sources