ARTFEED — Contemporary Art Intelligence

Wait-Think-Answer Control for Large Audio-Language Models

ai-technology · 2026-05-27

Researchers have created a new control system for Large Audio-Language Models (LALMs) aimed at improving how they reason and interact in real time during spoken conversations. This system helps the model decide when to pause, when to give a quick reasoning update, and when to respond, even if the audio information isn’t complete. Using Qwen2.5-Omni-7B as the base model, they developed sequences that align waiting, thinking, and answering from spoken reasoning data. The controller was trained via supervised fine-tuning (SFT) and employed Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). Their reward mechanism evaluates both the quality of answers and the time taken to respond.

Key facts

  • The control formulation is learnable and wait-think-answer based.
  • It is designed for Large Audio-Language Models (LALMs).
  • The base model used is Qwen2.5-Omni-7B.
  • Training involved supervised fine-tuning (SFT) and DAPO.
  • The controller decides when to wait, reason, or answer.
  • The reward combines answer quality and response delay.
  • The approach is motivated by incremental human conversation.
  • The work is published on arXiv with ID 2605.27190.

Entities

Institutions

  • arXiv

Sources