EchoDistill: New Framework to Make Audio LLMs Robust to Noise

ai-technology · 2026-05-26

Researchers have proposed EchoDistill, a self-distillation framework designed to improve the robustness of Audio Large Language Models (ALLMs) against real-world noise. ALLMs are known to suffer from semantic drift and hallucinations when exposed to noisy environments. Existing solutions rely on waveform-level enhancement, answer-level supervision, or internal noise suppression. EchoDistill introduces an alignment-based noisy-to-clean approach, using a frozen clean-audio teacher to guide a noisy-audio student during inference. The student samples candidate responses under noisy conditions, and these trajectories are optimized via group-relative policy optimization (GRPO), with token-level consistency with the teacher serving as a reward. The method also incorporates audio-aware reward shaping. The framework was detailed in a paper published on arXiv (ID: 2605.23954).

Key facts

EchoDistill is a noisy-to-clean self-distillation framework for Audio LLMs.
It addresses vulnerability to real-world noise causing semantic drift and hallucinations.
Existing methods use waveform-level enhancement, answer-level supervision, or noise suppression.
EchoDistill uses a frozen clean-audio teacher to guide a noisy-audio student.
The student samples candidate responses under noisy conditions.
Optimization uses group-relative policy optimization (GRPO).
Token-level consistency with the teacher acts as a reward bonus.
The paper is available on arXiv with ID 2605.23954.

EchoDistill: New Framework to Make Audio LLMs Robust to Noise

Key facts

Entities

Institutions

Sources