EmbodiedMidtrain: Bridging VLM-VLA Gap via Mid-Training

ai-technology · 2026-04-24

A new approach called EmbodiedMidtrain has been introduced by researchers to modify Vision-Language Models (VLMs) into Vision-Language-Action Models (VLAs) during mid-training. They discovered a discrepancy in data distribution, noting that VLA data clusters in distinct areas apart from the wider VLM distributions, with varying levels of alignment among different VLM sources. Their mid-training data engine employs a simple, learnable proximity estimator to choose VLA-aligned candidates from an extensive VLM pool, subsequently mid-training the VLM prior to fine-tuning for VLA applications. Results from tests on three robot manipulation benchmarks indicate consistent enhancements in performance.

Key facts

EmbodiedMidtrain bridges the gap between VLMs and VLAs
VLA data occupy compact regions separate from VLM distribution
Alignment varies across and within VLM data sources
Uses a lightweight learnable proximity estimator for data selection
Mid-training occurs before downstream VLA fine-tuning
Tested on three robot manipulation benchmarks
Consistent performance improvements observed
Proposed in arXiv:2604.20012

Entities

—

Sources

arXiv cs.AI — 2026-04-23