SkillNav: Modular Skill-Based Framework for Vision-and-Language Navigation
Researchers propose SkillNav, a modular framework for Vision-and-Language Navigation (VLN) that decomposes navigation into interpretable atomic skills such as Vertical Movement, Area and Region Identification, and Stop and Pause. Each skill is handled by a specialized agent within a Transformer-based architecture. To enable targeted skill training without manual annotation, the authors develop a synthetic dataset pipeline that generates diverse, linguistically natural instruction-trajectory pairs. The approach aims to improve generalization to unseen scenarios requiring complex spatial and temporal reasoning. The work is detailed in arXiv preprint 2508.07642.
Key facts
- SkillNav is a modular framework for VLN agents.
- It decomposes navigation into atomic skills like Vertical Movement, Area and Region Identification, Stop and Pause.
- Each skill is handled by a specialized agent.
- The framework is based on Transformer architecture.
- A synthetic dataset pipeline generates skill-specific instruction-trajectory pairs.
- The pipeline produces diverse, linguistically natural data without manual annotation.
- The method targets improved generalization to unseen scenarios.
- The research is published on arXiv with ID 2508.07642.
Entities
Institutions
- arXiv