ARTFEED — Contemporary Art Intelligence

SkillNav: Modular Skill-Based Framework for Vision-and-Language Navigation

other · 2026-05-14

Researchers propose SkillNav, a modular framework for Vision-and-Language Navigation (VLN) that decomposes navigation into interpretable atomic skills such as Vertical Movement, Area and Region Identification, and Stop and Pause. Each skill is handled by a specialized agent within a Transformer-based architecture. To enable targeted skill training without manual annotation, the authors develop a synthetic dataset pipeline that generates diverse, linguistically natural instruction-trajectory pairs. The approach aims to improve generalization to unseen scenarios requiring complex spatial and temporal reasoning. The work is detailed in arXiv preprint 2508.07642.

Key facts

  • SkillNav is a modular framework for VLN agents.
  • It decomposes navigation into atomic skills like Vertical Movement, Area and Region Identification, Stop and Pause.
  • Each skill is handled by a specialized agent.
  • The framework is based on Transformer architecture.
  • A synthetic dataset pipeline generates skill-specific instruction-trajectory pairs.
  • The pipeline produces diverse, linguistically natural data without manual annotation.
  • The method targets improved generalization to unseen scenarios.
  • The research is published on arXiv with ID 2508.07642.

Entities

Institutions

  • arXiv

Sources