ARTFEED — Contemporary Art Intelligence

VLA Foundry Framework Unifies Vision-Language-Action Model Training with Open-Source Release

ai-technology · 2026-04-22

VLA Foundry introduces an open-source framework that consolidates training for large language models, vision-language models, and vision-language-action models within a single codebase. Unlike specialized approaches that focus solely on action training stages, this framework offers a comprehensive training stack with end-to-end control from initial language pretraining to specialized action-expert fine-tuning. The system supports both training models from scratch and utilizing pretrained backbones available through Hugging Face. To demonstrate its capabilities, the framework has been used to train and release two distinct model types: one trained entirely from scratch using an LLM→VLM→VLA pipeline, and another constructed on the pretrained Qwen3-VL backbone. These models underwent evaluation for closed-loop policy performance using LBM Eval, an open-data, open-source simulator. The project also contributed usability enhancements to both the simulator and STEP analysis tools to facilitate broader public adoption. The framework addresses the common problem of incompatible pretraining pipelines in existing open-source VLA efforts by providing unified infrastructure.

Key facts

  • VLA Foundry is an open-source framework for unified model training
  • It combines LLM, VLM, and VLA training in a single codebase
  • The framework provides end-to-end control from pretraining to fine-tuning
  • It supports both from-scratch training and pretrained backbones from Hugging Face
  • Two model types were trained: one from scratch and one using Qwen3-VL backbone
  • Models were evaluated on LBM Eval simulator for closed-loop policy performance
  • Usability improvements were made to the simulator and STEP analysis tools
  • The framework addresses incompatible pretraining pipelines in existing VLA efforts

Entities

Institutions

  • Hugging Face

Sources