DIAL Framework Decouples Intent and Action in VLA Models

ai-technology · 2026-04-30

Researchers have introduced DIAL, a framework for Vision-Language-Action (VLA) models that separates high-level decision making from low-level motor execution. Unlike existing end-to-end VLAs that treat Vision-Language Models (VLMs) primarily as multimodal encoders, DIAL uses a VLM-based System-2 for latent world modeling, synthesizing visual foresight within the VLM's native feature space. This foresight encodes intent and acts as a structural bottleneck. A lightweight System-1 policy then decodes this intent into actions. The approach addresses training instability and underutilization of VLM semantic representations. The paper is available on arXiv under identifier 2603.29844.

Key facts

DIAL stands for Decoupling Intent and Action via Latent World Modeling.
The framework targets end-to-end Vision-Language-Action (VLA) models.
It uses a VLM-based System-2 for latent world modeling.
System-2 synthesizes latent visual foresight in the VLM's native feature space.
The foresight encodes intent and serves as a structural bottleneck.
A lightweight System-1 policy decodes intent into low-level actions.
The approach aims to reduce training instability and better utilize VLM capabilities.
The paper is published on arXiv with ID 2603.29844.

DIAL Framework Decouples Intent and Action in VLA Models

Key facts

Entities

Institutions

Sources