OneWM-VLA: Compressing Visual Stream to Single Token per Frame
Researchers propose OneWM-VLA, a method that compresses each video frame into a single semantic token for world-model-augmented vision-language-action (VLA) models. Existing approaches pass high-bandwidth visual streams into world modules, leaving per-frame representation and action coupling under-examined under constrained adaptation budgets. OneWM-VLA uses Adaptive Attention Pooling to reduce per-frame visual bandwidth to one token, and produces latent stream and action trajectory under a single flow-matching objective. Empirical results show no compromise in performance.
Key facts
- OneWM-VLA compresses each view into a single semantic token per frame.
- Uses Adaptive Attention Pooling for compression.
- Employs a single flow-matching objective for latent stream and action trajectory.
- Addresses limitations of existing world-model-augmented VLAs.
- Reduces per-frame visual bandwidth without compromising performance.
- Published on arXiv with ID 2605.07931.
- Focuses on vision-language-action models.
- Aims to improve long-horizon planning.
Entities
Institutions
- arXiv