Wait-Think-Answer Control for Large Audio-Language Models
Researchers have created a new control system for Large Audio-Language Models (LALMs) aimed at improving how they reason and interact in real time during spoken conversations. This system helps the model decide when to pause, when to give a quick reasoning update, and when to respond, even if the audio information isn’t complete. Using Qwen2.5-Omni-7B as the base model, they developed sequences that align waiting, thinking, and answering from spoken reasoning data. The controller was trained via supervised fine-tuning (SFT) and employed Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). Their reward mechanism evaluates both the quality of answers and the time taken to respond.
Key facts
- The control formulation is learnable and wait-think-answer based.
- It is designed for Large Audio-Language Models (LALMs).
- The base model used is Qwen2.5-Omni-7B.
- Training involved supervised fine-tuning (SFT) and DAPO.
- The controller decides when to wait, reason, or answer.
- The reward combines answer quality and response delay.
- The approach is motivated by incremental human conversation.
- The work is published on arXiv with ID 2605.27190.
Entities
Institutions
- arXiv