DIAL Framework Decouples Intent and Action in VLA Models
Researchers have introduced DIAL, a framework for Vision-Language-Action (VLA) models that separates high-level decision making from low-level motor execution. Unlike existing end-to-end VLAs that treat Vision-Language Models (VLMs) primarily as multimodal encoders, DIAL uses a VLM-based System-2 for latent world modeling, synthesizing visual foresight within the VLM's native feature space. This foresight encodes intent and acts as a structural bottleneck. A lightweight System-1 policy then decodes this intent into actions. The approach addresses training instability and underutilization of VLM semantic representations. The paper is available on arXiv under identifier 2603.29844.
Key facts
- DIAL stands for Decoupling Intent and Action via Latent World Modeling.
- The framework targets end-to-end Vision-Language-Action (VLA) models.
- It uses a VLM-based System-2 for latent world modeling.
- System-2 synthesizes latent visual foresight in the VLM's native feature space.
- The foresight encodes intent and serves as a structural bottleneck.
- A lightweight System-1 policy decodes intent into low-level actions.
- The approach aims to reduce training instability and better utilize VLM capabilities.
- The paper is published on arXiv with ID 2603.29844.
Entities
Institutions
- arXiv