Data Infrastructure as the Bottleneck in Vision-Language-Action Robotics

other · 2026-04-29

A recent survey suggests that advancements in Vision-Language-Action (VLA) models are influenced more by data infrastructure than by architectural design. The study thoroughly examines VLA research through three main components: datasets, benchmarks, and data engines. It classifies both real and synthetic datasets based on embodiment diversity, modality composition, and action space formulation, highlighting a trade-off between fidelity and cost. The analysis of benchmarks uncovers deficiencies in compositional generalization and assessments of long-horizon reasoning. The survey emphasizes the importance of collaboratively designing high-fidelity data engines alongside structured evaluation methods.

Key facts

The survey is organized around datasets, benchmarks, and data engines.
It categorizes real-world and synthetic corpora by embodiment diversity, modality composition, and action space formulation.
A persistent fidelity-cost trade-off constrains large-scale collection.
Benchmark analysis reveals structural gaps in compositional generalization and long-horizon reasoning evaluation.
The paper argues future VLA advances depend on data infrastructure co-design.
The source is arXiv:2604.23001.
The survey is data-centric.
It examines three pillars: datasets, benchmarks, and data engines.

Data Infrastructure as the Bottleneck in Vision-Language-Action Robotics

Key facts

Entities

Institutions

Sources