EmbodiedMidtrain: Bridging VLM-VLA Gap via Mid-Training
A new approach called EmbodiedMidtrain has been introduced by researchers to modify Vision-Language Models (VLMs) into Vision-Language-Action Models (VLAs) during mid-training. They discovered a discrepancy in data distribution, noting that VLA data clusters in distinct areas apart from the wider VLM distributions, with varying levels of alignment among different VLM sources. Their mid-training data engine employs a simple, learnable proximity estimator to choose VLA-aligned candidates from an extensive VLM pool, subsequently mid-training the VLM prior to fine-tuning for VLA applications. Results from tests on three robot manipulation benchmarks indicate consistent enhancements in performance.
Key facts
- EmbodiedMidtrain bridges the gap between VLMs and VLAs
- VLA data occupy compact regions separate from VLM distribution
- Alignment varies across and within VLM data sources
- Uses a lightweight learnable proximity estimator for data selection
- Mid-training occurs before downstream VLA fine-tuning
- Tested on three robot manipulation benchmarks
- Consistent performance improvements observed
- Proposed in arXiv:2604.20012
Entities
—