FineVLA: Fine-Grained Instruction Alignment for Steerable Robot Policies
Models that integrate vision, language, and action (VLA) are increasingly anticipated to adhere to human directives for task performance. However, current robotic datasets fall short in providing detailed action information. FineVLA presents a comprehensive framework for action-aligned, fine-grained VLA supervision. This includes a tool for data construction that consolidates 972,247 trajectories from 85,000 tasks across 10 open-source robotic datasets. Additionally, it features a human-verified dataset comprising 47,159 fine-grained trajectories, a benchmark with 500 videos, 10,816 atomic facts, and 1,030 visual question-answering (VQA) inquiries, along with a specialized VLM annotator tailored for scalable annotation in robotics.
Key facts
- FineVLA is an open framework for fine-grained VLA supervision
- Data construction tool unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets
- FineVLA-Data is a human-verified dataset of 47,159 fine-grained trajectories
- Held-out benchmark includes 500 videos, 10,816 atomic facts, and 1,030 VQA questions
- Includes a robotics-specialized VLM annotator for scalable fine-grained annotation
- Addresses lack of execution-critical details in existing robot datasets
- Enables steerable policy learning and robotic video understanding
- Published on arXiv as 2605.27284
Entities
Institutions
- arXiv