S-SONDO: First Self-Supervised Knowledge Distillation for Audio Foundation Models

other · 2026-04-30

A novel framework named S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models) has been developed by researchers to distill general audio foundation models solely from their output embeddings. This innovative method is architecture-agnostic, eliminating the necessity for logits or layer-level alignment, thus making it suitable for embedding-based models like self-supervised or metric-learning models. The challenge posed by large, state-of-the-art audio models, which can contain hundreds of millions of parameters, is addressed, as these models often incur high inference costs and are difficult to deploy on edge devices. Traditional audio knowledge distillation techniques have concentrated on supervised environments, often depending on class logits or specific architectural methods, which do not accommodate models that produce only embeddings. S-SONDO effectively resolves this issue, allowing for model compression without needing access to internal structures. The research paper can be found on arXiv with ID 2604.24933.

Key facts

S-SONDO is the first framework for self-supervised knowledge distillation of general audio foundation models.
It uses only output embeddings, avoiding logits or layer-level alignment.
The framework is architecture-agnostic and applicable to embedding-based models.
State-of-the-art audio models often have hundreds of millions of parameters.
Prior audio knowledge distillation methods were limited to supervised settings.
S-SONDO enables model compression for edge device deployment.
The paper is published on arXiv with ID 2604.24933.
The approach works for self-supervised and metric-learning models.

S-SONDO: First Self-Supervised Knowledge Distillation for Audio Foundation Models

Key facts

Entities

Institutions

Sources