EchoDistill: New Framework to Make Audio LLMs Robust to Noise
Researchers have proposed EchoDistill, a self-distillation framework designed to improve the robustness of Audio Large Language Models (ALLMs) against real-world noise. ALLMs are known to suffer from semantic drift and hallucinations when exposed to noisy environments. Existing solutions rely on waveform-level enhancement, answer-level supervision, or internal noise suppression. EchoDistill introduces an alignment-based noisy-to-clean approach, using a frozen clean-audio teacher to guide a noisy-audio student during inference. The student samples candidate responses under noisy conditions, and these trajectories are optimized via group-relative policy optimization (GRPO), with token-level consistency with the teacher serving as a reward. The method also incorporates audio-aware reward shaping. The framework was detailed in a paper published on arXiv (ID: 2605.23954).
Key facts
- EchoDistill is a noisy-to-clean self-distillation framework for Audio LLMs.
- It addresses vulnerability to real-world noise causing semantic drift and hallucinations.
- Existing methods use waveform-level enhancement, answer-level supervision, or noise suppression.
- EchoDistill uses a frozen clean-audio teacher to guide a noisy-audio student.
- The student samples candidate responses under noisy conditions.
- Optimization uses group-relative policy optimization (GRPO).
- Token-level consistency with the teacher acts as a reward bonus.
- The paper is available on arXiv with ID 2605.23954.
Entities
Institutions
- arXiv