UAM: A Dual-Stream Perspective on Forgetting in VLA Training

ai-technology · 2026-05-18

A recent study published on arXiv (2605.15735) indicates that the typical fine-tuning process for vision-language-action (VLA) models derived from pretrained vision-language models (VLMs) leads to a gradual decline in multimodal abilities, referred to as the 'embodiment tax.' The researchers attribute this decline to a structural limitation: existing VLAs utilize a single encoder for both language-based semantics and visual features relevant to control, unlike natural vision, which differentiates recognition from visuomotor control. To remedy this issue, they introduce the Unified Action Model (UAM), which incorporates a parallel Dorsal Expert, mirroring the brain's dorsal pathway. This Dorsal Expert is initialized from a pretrained generative model and trained with a mid-level objective to alleviate the control-learning demands on the VLM. The paper does not disclose authors or affiliations.

Key facts

Paper arXiv:2605.15735 proposes Unified Action Model (UAM).
Standard VLA fine-tuning causes 'embodiment tax'—erosion of multimodal competence.
Bottleneck identified: single encoder for semantics and control.
UAM adds a parallel Dorsal Expert inspired by biological vision.
Dorsal Expert initialized from pretrained generative model.
Mid-level training objective reduces control-learning burden on VLM.
Announcement type: cross.
No authors or institutions named in abstract.

Entities

—

Sources

arXiv cs.AI — 2026-05-18