Geometry-Aware BEV Improves Vision-Language Navigation
A novel technique known as Geometry-Aware BEV (GA-BEV) enhances spatial reasoning and minimizes computational demands in Vision-Language Navigation (VLN). Traditional methods depend on dense RGB videos filled with numerous patch tokens and often lack a clear spatial framework. GA-BEV creates streamlined 3D-grounded feature representations from RGB-D data, projecting visual elements into a three-dimensional environment and organizing them into an agent-focused format. It merges both explicit and implicit geometric information into multimodal large language model (MLLM)-driven navigation systems. Additionally, it utilizes features from a pretrained 3D foundation model to incorporate structural insights from extensive 3D reconstruction efforts. This approach decreases token redundancy while maintaining geometric integrity. The research can be found on arXiv under ID 2605.22036.
Key facts
- GA-BEV is a compact, 3D-grounded feature representation for VLN.
- It uses RGB-D inputs to construct BEV spatial maps.
- Visual features are projected into 3D space and aggregated into an agent-centric layout.
- Geometric consistency is preserved while token redundancy is reduced.
- Features from a pretrained 3D foundation model enrich geometric understanding.
- The method integrates explicit and implicit geometric cues into MLLM-based navigation.
- It addresses computational overhead and limited spatial reasoning in existing VLN approaches.
- The paper is arXiv:2605.22036.
Entities
Institutions
- arXiv