ResVLA: Residual Bridges for Generative VLA Policies
A new architecture called ResVLA addresses the spatiotemporal scale mismatch in embodied intelligence by shifting from a 'Generation-from-Noise' to a 'Refinement-from-Intent' paradigm. The model uses spectral analysis to decompose robotic motion into deterministic low-frequency intent and stochastic high-frequency residuals, anchoring the generative process on predicted intent via a residual diffusion bridge. This approach improves representation efficiency and condition alignment. The paper is available on arXiv under reference 2604.21391.
Key facts
- ResVLA shifts paradigm from 'Generation-from-Noise' to 'Refinement-from-Intent'.
- It uses spectral analysis to decouple control into low-frequency anchor and high-frequency residual.
- The generative process is anchored on predicted intent via a residual diffusion bridge.
- The paper is on arXiv with ID 2604.21391.
- It addresses spatiotemporal scale mismatch in embodied intelligence.
- Existing generative VLA policies suffer from representation inefficiency and weak condition alignment.
- Robotic motion is decomposed into global intent and local dynamics.
- Extensive simulation experiments show effectiveness.
Entities
Institutions
- arXiv