ARTFEED — Contemporary Art Intelligence

EmbodiedMidtrain: Bridging VLM-VLA Gap via Mid-Training

ai-technology · 2026-04-24

A new approach called EmbodiedMidtrain has been introduced by researchers to modify Vision-Language Models (VLMs) into Vision-Language-Action Models (VLAs) during mid-training. They discovered a discrepancy in data distribution, noting that VLA data clusters in distinct areas apart from the wider VLM distributions, with varying levels of alignment among different VLM sources. Their mid-training data engine employs a simple, learnable proximity estimator to choose VLA-aligned candidates from an extensive VLM pool, subsequently mid-training the VLM prior to fine-tuning for VLA applications. Results from tests on three robot manipulation benchmarks indicate consistent enhancements in performance.

Key facts

  • EmbodiedMidtrain bridges the gap between VLMs and VLAs
  • VLA data occupy compact regions separate from VLM distribution
  • Alignment varies across and within VLM data sources
  • Uses a lightweight learnable proximity estimator for data selection
  • Mid-training occurs before downstream VLA fine-tuning
  • Tested on three robot manipulation benchmarks
  • Consistent performance improvements observed
  • Proposed in arXiv:2604.20012

Entities

Sources