ResVLA: Residual Bridges for Generative VLA Policies

ai-technology · 2026-04-25

A new architecture called ResVLA addresses the spatiotemporal scale mismatch in embodied intelligence by shifting from a 'Generation-from-Noise' to a 'Refinement-from-Intent' paradigm. The model uses spectral analysis to decompose robotic motion into deterministic low-frequency intent and stochastic high-frequency residuals, anchoring the generative process on predicted intent via a residual diffusion bridge. This approach improves representation efficiency and condition alignment. The paper is available on arXiv under reference 2604.21391.

Key facts

ResVLA shifts paradigm from 'Generation-from-Noise' to 'Refinement-from-Intent'.
It uses spectral analysis to decouple control into low-frequency anchor and high-frequency residual.
The generative process is anchored on predicted intent via a residual diffusion bridge.
The paper is on arXiv with ID 2604.21391.
It addresses spatiotemporal scale mismatch in embodied intelligence.
Existing generative VLA policies suffer from representation inefficiency and weak condition alignment.
Robotic motion is decomposed into global intent and local dynamics.
Extensive simulation experiments show effectiveness.

ResVLA: Residual Bridges for Generative VLA Policies

Key facts

Entities

Institutions

Sources