Wait-Think-Answer Control for Large Audio-Language Models

ai-technology · 2026-05-27

Researchers have created a new control system for Large Audio-Language Models (LALMs) aimed at improving how they reason and interact in real time during spoken conversations. This system helps the model decide when to pause, when to give a quick reasoning update, and when to respond, even if the audio information isn’t complete. Using Qwen2.5-Omni-7B as the base model, they developed sequences that align waiting, thinking, and answering from spoken reasoning data. The controller was trained via supervised fine-tuning (SFT) and employed Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). Their reward mechanism evaluates both the quality of answers and the time taken to respond.

Key facts

The control formulation is learnable and wait-think-answer based.
It is designed for Large Audio-Language Models (LALMs).
The base model used is Qwen2.5-Omni-7B.
Training involved supervised fine-tuning (SFT) and DAPO.
The controller decides when to wait, reason, or answer.
The reward combines answer quality and response delay.
The approach is motivated by incremental human conversation.
The work is published on arXiv with ID 2605.27190.

Wait-Think-Answer Control for Large Audio-Language Models

Key facts

Entities

Institutions

Sources