VLA Foundry Framework Unifies Vision-Language-Action Model Training with Open-Source Release

ai-technology · 2026-04-22

VLA Foundry introduces an open-source framework that consolidates training for large language models, vision-language models, and vision-language-action models within a single codebase. Unlike specialized approaches that focus solely on action training stages, this framework offers a comprehensive training stack with end-to-end control from initial language pretraining to specialized action-expert fine-tuning. The system supports both training models from scratch and utilizing pretrained backbones available through Hugging Face. To demonstrate its capabilities, the framework has been used to train and release two distinct model types: one trained entirely from scratch using an LLM→VLM→VLA pipeline, and another constructed on the pretrained Qwen3-VL backbone. These models underwent evaluation for closed-loop policy performance using LBM Eval, an open-data, open-source simulator. The project also contributed usability enhancements to both the simulator and STEP analysis tools to facilitate broader public adoption. The framework addresses the common problem of incompatible pretraining pipelines in existing open-source VLA efforts by providing unified infrastructure.

Key facts

VLA Foundry is an open-source framework for unified model training
It combines LLM, VLM, and VLA training in a single codebase
The framework provides end-to-end control from pretraining to fine-tuning
It supports both from-scratch training and pretrained backbones from Hugging Face
Two model types were trained: one from scratch and one using Qwen3-VL backbone
Models were evaluated on LBM Eval simulator for closed-loop policy performance
Usability improvements were made to the simulator and STEP analysis tools
The framework addresses incompatible pretraining pipelines in existing VLA efforts

VLA Foundry Framework Unifies Vision-Language-Action Model Training with Open-Source Release

Key facts

Entities

Institutions

Sources