AudioMosaic: Contrastive Audio SSL with Masked Spectrograms

ai-technology · 2026-05-16

Researchers introduce AudioMosaic, a contrastive self-supervised learning method for audio representation. It constructs positive pairs via structured time-frequency masking on spectrogram patches, enabling efficient large-batch training. The encoder learns discriminative utterance-level representations that transfer well across datasets and conditions, outperforming generative approaches in experiments.

Key facts

AudioMosaic is a contrastive learning-based audio encoder for general audio understanding.
It uses structured time-frequency masking on spectrogram patches to create positive pairs.
The method reduces memory usage and enables efficient large-batch training.
It learns more discriminative utterance-level representations than generative approaches.
Representations show strong transferability across datasets, domains, and acoustic conditions.
Extensive experiments demonstrate its effectiveness.
The paper is available on arXiv under ID 2605.14231.
The approach addresses challenges in contrastive audio SSL such as augmentation design and batch size.

AudioMosaic: Contrastive Audio SSL with Masked Spectrograms

Key facts

Entities

Institutions

Sources