VLA-AD: Distilling Large Vision-Language-Action Models into Lightweight Policies

ai-technology · 2026-05-18

A new framework called VLA-AD has been introduced by researchers, utilizing a Vision-Language Model as an offline semantic guide to condense billion-parameter Vision-Language-Action (VLA) policies into more compact student models. This approach enhances the action targets provided by the teacher with high-level semantic cues, such as task phase anchors and descriptions of operating directions across multiple frames, which are only applied during the training phase. Once testing begins, the student policy operates autonomously, without reliance on the teacher or VLM. When assessed on three LIBERO benchmark suites with OpenVLA-7B as the teacher, VLA-AD achieves a student model of 158M parameters, resulting in a 44× decrease in model size, tackling a significant challenge in real-time closed-loop control for robotic manipulation.

Key facts

VLA-AD uses a Vision-Language Model as an offline semantic supervisor.
It distills large VLA teachers into lightweight student policies.
Augments teacher-provided 7-DoF action targets with semantic guidance.
Semantic guidance includes task phase anchors and multi-frame operating-direction descriptions.
Auxiliary signals are used only during training.
At test time, the student policy runs independently.
Evaluated on three LIBERO benchmark suites.
Uses OpenVLA-7B as teacher, produces 158M-parameter student, 44× reduction.

Entities

—

Sources

arXiv cs.AI — 2026-05-18