ARTFEED — Contemporary Art Intelligence

OneWM-VLA: Compressing Visual Stream to Single Token per Frame

ai-technology · 2026-05-11

Researchers propose OneWM-VLA, a method that compresses each video frame into a single semantic token for world-model-augmented vision-language-action (VLA) models. Existing approaches pass high-bandwidth visual streams into world modules, leaving per-frame representation and action coupling under-examined under constrained adaptation budgets. OneWM-VLA uses Adaptive Attention Pooling to reduce per-frame visual bandwidth to one token, and produces latent stream and action trajectory under a single flow-matching objective. Empirical results show no compromise in performance.

Key facts

  • OneWM-VLA compresses each view into a single semantic token per frame.
  • Uses Adaptive Attention Pooling for compression.
  • Employs a single flow-matching objective for latent stream and action trajectory.
  • Addresses limitations of existing world-model-augmented VLAs.
  • Reduces per-frame visual bandwidth without compromising performance.
  • Published on arXiv with ID 2605.07931.
  • Focuses on vision-language-action models.
  • Aims to improve long-horizon planning.

Entities

Institutions

  • arXiv

Sources