SkillNav: Modular Skill-Based Framework for Vision-and-Language Navigation

other · 2026-05-14

Researchers propose SkillNav, a modular framework for Vision-and-Language Navigation (VLN) that decomposes navigation into interpretable atomic skills such as Vertical Movement, Area and Region Identification, and Stop and Pause. Each skill is handled by a specialized agent within a Transformer-based architecture. To enable targeted skill training without manual annotation, the authors develop a synthetic dataset pipeline that generates diverse, linguistically natural instruction-trajectory pairs. The approach aims to improve generalization to unseen scenarios requiring complex spatial and temporal reasoning. The work is detailed in arXiv preprint 2508.07642.

Key facts

SkillNav is a modular framework for VLN agents.
It decomposes navigation into atomic skills like Vertical Movement, Area and Region Identification, Stop and Pause.
Each skill is handled by a specialized agent.
The framework is based on Transformer architecture.
A synthetic dataset pipeline generates skill-specific instruction-trajectory pairs.
The pipeline produces diverse, linguistically natural data without manual annotation.
The method targets improved generalization to unseen scenarios.
The research is published on arXiv with ID 2508.07642.

SkillNav: Modular Skill-Based Framework for Vision-and-Language Navigation

Key facts

Entities

Institutions

Sources