ARTFEED — Contemporary Art Intelligence

LoSATok: Low-Dimensional Audio Tokenizer for Cross-Domain Understanding and Generation

ai-technology · 2026-05-28

Researchers propose LoSATok, a low-dimensional audio tokenizer designed to unify audio understanding and generation across domains. Traditional unified tokenizers encode both semantic and acoustic details in high-dimensional continuous latents, increasing the modeling burden for Diffusion Transformers (DiTs). LoSATok introduces a Semantic Bottleneck that compresses 1280-dimensional semantic encoder features into 128 dimensions, regularized by a time-relation loss for temporal feature consistency. A dual-level semantic supervision method leverages both high- and low-dimensional semantic signals, enabling joint capture of semantics and acoustic details within a compact latent space. The approach is motivated by the observation that high-dimensional semantic features are compressible. The work is published on arXiv under ID 2605.27840.

Key facts

  • LoSATok is a low-dimensional audio tokenizer for cross-domain audio understanding and generation.
  • It compresses 1280-dimensional semantic encoder features into 128 dimensions using a Semantic Bottleneck.
  • A time-relation loss regularizes temporal feature consistency.
  • Dual-level semantic supervision uses high- and low-dimensional semantic signals.
  • The tokenizer aims to reduce the modeling burden on Diffusion Transformers (DiTs).
  • The work is published on arXiv with ID 2605.27840.
  • The approach is based on the observation that high-dimensional semantic features are compressible.
  • LoSATok unifies audio understanding and generation in a compact latent space.

Entities

Institutions

  • arXiv

Sources