P2DNav: Hierarchical Framework for Zero-Shot Vision-and-Language Navigation

ai-technology · 2026-05-20

P2DNav is a newly proposed hierarchical framework designed for zero-shot vision-and-language navigation (VLN). It breaks down the navigation process into two key stages: panoramic direction selection and downview local grounding. The framework includes three components: P2D, SDM, and RRM. P2D is responsible for selecting directions from a 360-degree panorama, while it subsequently predicts pixel-level targets using downview RGB images. This innovative approach aims to minimize errors caused by complex reasoning when navigating in unseen environments. The research has been published on arXiv under the identifier 2605.19634.

Key facts

P2DNav is a hierarchical framework for zero-shot VLN
It decomposes navigation into panoramic direction selection and downview local grounding
Components: P2D, SDM, RRM
P2D selects direction from 360-degree panorama
Then predicts pixel-level target from downview RGB
Aims to reduce errors from entangled reasoning
Published on arXiv:2605.19634
Addresses zero-shot VLN in unseen environments

P2DNav: Hierarchical Framework for Zero-Shot Vision-and-Language Navigation

Key facts

Entities

Institutions

Sources