ForgeVLA: Federated Robot Learning Without Language Labels

ai-technology · 2026-05-11

Researchers propose ForgeVLA, a federated learning framework for Vision-Language-Action (VLA) models that eliminates the need for manual language annotations. The system trains on distributed vision-action pairs from robots across different domains without centralizing raw data. Each client uses an embodied instruction classifier to map vision-action pairs to predefined instructions, recovering the missing language modality. This approach addresses data heterogeneity and privacy constraints while scaling VLA training efficiently. The paper is available on arXiv under ID 2605.07474.

Key facts

ForgeVLA is a federated VLA training framework.
It learns from distributed vision-action pairs without centralizing raw data.
No manual language annotations are required.
Each client uses an embodied instruction classifier.
The classifier maps vision-action pairs to a predefined instruction set.
The approach addresses data heterogeneity and privacy constraints.
The paper is on arXiv with ID 2605.07474.
The framework aims to scale VLA models for general-purpose robotic intelligence.

ForgeVLA: Federated Robot Learning Without Language Labels

Key facts

Entities

Institutions

Sources