ARTFEED — Contemporary Art Intelligence

AudioMosaic: Contrastive Audio SSL with Masked Spectrograms

ai-technology · 2026-05-16

Researchers introduce AudioMosaic, a contrastive self-supervised learning method for audio representation. It constructs positive pairs via structured time-frequency masking on spectrogram patches, enabling efficient large-batch training. The encoder learns discriminative utterance-level representations that transfer well across datasets and conditions, outperforming generative approaches in experiments.

Key facts

  • AudioMosaic is a contrastive learning-based audio encoder for general audio understanding.
  • It uses structured time-frequency masking on spectrogram patches to create positive pairs.
  • The method reduces memory usage and enables efficient large-batch training.
  • It learns more discriminative utterance-level representations than generative approaches.
  • Representations show strong transferability across datasets, domains, and acoustic conditions.
  • Extensive experiments demonstrate its effectiveness.
  • The paper is available on arXiv under ID 2605.14231.
  • The approach addresses challenges in contrastive audio SSL such as augmentation design and batch size.

Entities

Institutions

  • arXiv

Sources