DVGT-2: Streaming Vision-Geometry-Action Model for Autonomous Driving
A novel framework for end-to-end autonomous driving, termed Vision-Geometry-Action (VGA), has been introduced by researchers, emphasizing the significance of dense 3D geometry over the language-based descriptions found in vision-language-action (VLA) models. They present DVGT-2, a Driving Visual Geometry Transformer designed for streaming that processes inputs in real-time, simultaneously generating dense geometry and trajectory planning for the current frame. This innovation tackles the high computational demands of earlier geometry reconstruction techniques like DVGT, which depend on batch processing of multiple frames and are unsuitable for online planning. Utilizing temporal causal attention, the model facilitates immediate decision-making. The research can be accessed on arXiv with the identifier 2604.00813.
Key facts
- DVGT-2 is a streaming Driving Visual Geometry Transformer for autonomous driving.
- It processes inputs online and jointly outputs dense geometry and trajectory planning.
- The model uses temporal causal attention for real-time decision-making.
- Prior geometry reconstruction methods like DVGT rely on batch processing of multi-frame inputs.
- The VGA paradigm advocates dense 3D geometry as the critical cue for autonomous driving.
- VLA models focus on learning language descriptions as an auxiliary task.
- The paper is available on arXiv with identifier 2604.00813.
- The approach aims to overcome computational expense of existing geometry methods.
Entities
Institutions
- arXiv