LoSATok: Low-Dimensional Audio Tokenizer for Cross-Domain Understanding and Generation

ai-technology · 2026-05-28

Researchers propose LoSATok, a low-dimensional audio tokenizer designed to unify audio understanding and generation across domains. Traditional unified tokenizers encode both semantic and acoustic details in high-dimensional continuous latents, increasing the modeling burden for Diffusion Transformers (DiTs). LoSATok introduces a Semantic Bottleneck that compresses 1280-dimensional semantic encoder features into 128 dimensions, regularized by a time-relation loss for temporal feature consistency. A dual-level semantic supervision method leverages both high- and low-dimensional semantic signals, enabling joint capture of semantics and acoustic details within a compact latent space. The approach is motivated by the observation that high-dimensional semantic features are compressible. The work is published on arXiv under ID 2605.27840.

Key facts

LoSATok is a low-dimensional audio tokenizer for cross-domain audio understanding and generation.
It compresses 1280-dimensional semantic encoder features into 128 dimensions using a Semantic Bottleneck.
A time-relation loss regularizes temporal feature consistency.
Dual-level semantic supervision uses high- and low-dimensional semantic signals.
The tokenizer aims to reduce the modeling burden on Diffusion Transformers (DiTs).
The work is published on arXiv with ID 2605.27840.
The approach is based on the observation that high-dimensional semantic features are compressible.
LoSATok unifies audio understanding and generation in a compact latent space.

LoSATok: Low-Dimensional Audio Tokenizer for Cross-Domain Understanding and Generation

Key facts

Entities

Institutions

Sources