Data Infrastructure as the Bottleneck in Vision-Language-Action Robotics
A recent survey suggests that advancements in Vision-Language-Action (VLA) models are influenced more by data infrastructure than by architectural design. The study thoroughly examines VLA research through three main components: datasets, benchmarks, and data engines. It classifies both real and synthetic datasets based on embodiment diversity, modality composition, and action space formulation, highlighting a trade-off between fidelity and cost. The analysis of benchmarks uncovers deficiencies in compositional generalization and assessments of long-horizon reasoning. The survey emphasizes the importance of collaboratively designing high-fidelity data engines alongside structured evaluation methods.
Key facts
- The survey is organized around datasets, benchmarks, and data engines.
- It categorizes real-world and synthetic corpora by embodiment diversity, modality composition, and action space formulation.
- A persistent fidelity-cost trade-off constrains large-scale collection.
- Benchmark analysis reveals structural gaps in compositional generalization and long-horizon reasoning evaluation.
- The paper argues future VLA advances depend on data infrastructure co-design.
- The source is arXiv:2604.23001.
- The survey is data-centric.
- It examines three pillars: datasets, benchmarks, and data engines.
Entities
Institutions
- arXiv