LoSATok: Low-Dimensional Audio Tokenizer for Cross-Domain Understanding and Generation
Researchers propose LoSATok, a low-dimensional audio tokenizer designed to unify audio understanding and generation across domains. Traditional unified tokenizers encode both semantic and acoustic details in high-dimensional continuous latents, increasing the modeling burden for Diffusion Transformers (DiTs). LoSATok introduces a Semantic Bottleneck that compresses 1280-dimensional semantic encoder features into 128 dimensions, regularized by a time-relation loss for temporal feature consistency. A dual-level semantic supervision method leverages both high- and low-dimensional semantic signals, enabling joint capture of semantics and acoustic details within a compact latent space. The approach is motivated by the observation that high-dimensional semantic features are compressible. The work is published on arXiv under ID 2605.27840.
Key facts
- LoSATok is a low-dimensional audio tokenizer for cross-domain audio understanding and generation.
- It compresses 1280-dimensional semantic encoder features into 128 dimensions using a Semantic Bottleneck.
- A time-relation loss regularizes temporal feature consistency.
- Dual-level semantic supervision uses high- and low-dimensional semantic signals.
- The tokenizer aims to reduce the modeling burden on Diffusion Transformers (DiTs).
- The work is published on arXiv with ID 2605.27840.
- The approach is based on the observation that high-dimensional semantic features are compressible.
- LoSATok unifies audio understanding and generation in a compact latent space.
Entities
Institutions
- arXiv