Inverse Kinematics Solver Improves Driving VLA Trajectory Prediction
A recent study published on arXiv (2605.21061) uncovers a critical issue in current Driving Vision-Language-Action (VLA) models: they overlook visual tokens in trajectory predictions. The researchers attribute this problem to a poorly structured task formulation. They demonstrate that recovering trajectories through inverse kinematics necessitates both present and future visual states as boundary conditions. Current VLAs only provide the present state, leading to reliance on ego status and text instructions. To address this, the authors propose redesigning the Driving VLA as an inverse kinematics solver. This involves a future visual state prediction goal that compels the LLM to forecast the upcoming visual scene, enhancing visual supervision and minimizing shortcuts. Additionally, a distinct Inverse Kinematics Network—based on cross-attention conditional diffusion—utilizes only current and future visual states, reducing dependence on ego status and text commands. This strategy aims to enhance the robustness and visual grounding of trajectory predictions.
Key facts
- arXiv paper 2605.21061 identifies a flaw in Driving VLAs: they ignore visual tokens during trajectory prediction.
- The problem is traced to a structurally ill-posed task formulation, not insufficient training.
- Trajectory recovery via inverse kinematics requires both current and future visual states as boundary conditions.
- Existing VLAs supply only the current visual state, encouraging shortcuts through ego status and text commands.
- The proposed solution includes a next visual state prediction objective for dense visual supervision.
- A separate Inverse Kinematics Network uses cross-attention-based conditional diffusion, taking only current and future visual states.
- The design suppresses reliance on ego status and textual commands.
- The approach aims to improve robustness and visual grounding of trajectory prediction.
Entities
Institutions
- arXiv