OneWM-VLA: Compressing Visual Stream to Single Token per Frame

ai-technology · 2026-05-11

Researchers propose OneWM-VLA, a method that compresses each video frame into a single semantic token for world-model-augmented vision-language-action (VLA) models. Existing approaches pass high-bandwidth visual streams into world modules, leaving per-frame representation and action coupling under-examined under constrained adaptation budgets. OneWM-VLA uses Adaptive Attention Pooling to reduce per-frame visual bandwidth to one token, and produces latent stream and action trajectory under a single flow-matching objective. Empirical results show no compromise in performance.

Key facts

OneWM-VLA compresses each view into a single semantic token per frame.
Uses Adaptive Attention Pooling for compression.
Employs a single flow-matching objective for latent stream and action trajectory.
Addresses limitations of existing world-model-augmented VLAs.
Reduces per-frame visual bandwidth without compromising performance.
Published on arXiv with ID 2605.07931.
Focuses on vision-language-action models.
Aims to improve long-horizon planning.

OneWM-VLA: Compressing Visual Stream to Single Token per Frame

Key facts

Entities

Institutions

Sources