ARTFEED — Contemporary Art Intelligence

ResVLA: Residual Bridges for Generative VLA Policies

ai-technology · 2026-04-25

A new architecture called ResVLA addresses the spatiotemporal scale mismatch in embodied intelligence by shifting from a 'Generation-from-Noise' to a 'Refinement-from-Intent' paradigm. The model uses spectral analysis to decompose robotic motion into deterministic low-frequency intent and stochastic high-frequency residuals, anchoring the generative process on predicted intent via a residual diffusion bridge. This approach improves representation efficiency and condition alignment. The paper is available on arXiv under reference 2604.21391.

Key facts

  • ResVLA shifts paradigm from 'Generation-from-Noise' to 'Refinement-from-Intent'.
  • It uses spectral analysis to decouple control into low-frequency anchor and high-frequency residual.
  • The generative process is anchored on predicted intent via a residual diffusion bridge.
  • The paper is on arXiv with ID 2604.21391.
  • It addresses spatiotemporal scale mismatch in embodied intelligence.
  • Existing generative VLA policies suffer from representation inefficiency and weak condition alignment.
  • Robotic motion is decomposed into global intent and local dynamics.
  • Extensive simulation experiments show effectiveness.

Entities

Institutions

  • arXiv

Sources