DeMaVLA: A VLA Foundation Model for Generalizable Deformable Manipulation
A new foundation model called DeMaVLA has been developed by researchers, focusing on Vision-Language-Action (VLA) for versatile deformable manipulation, particularly in folding deformable objects. This model employs a VLM backbone alongside an action expert and flow matching to facilitate continuous action generation. Efficiency is achieved by pruning every other transformer layer within the action expert. DeMaVLA seeks to address the shortcomings of current VLA systems that create distinct policies for various object types, which often leads to task interference and reduced effectiveness. It is designed for household robots to manage clothing items from diverse initial conditions, materials, shapes, and environments. The findings are published in arXiv paper 2605.31286.
Key facts
- DeMaVLA is a VLA foundation model for generalizable deformable manipulation.
- It targets deformable-object folding, a representative challenge for household robots.
- The model uses a VLM backbone with an action expert.
- Continuous action generation is formulated using flow matching.
- Action expert efficiency is improved by pruning every other transformer layer.
- Existing VLA systems train separate policies for different object categories.
- Naively mixed multi-task training often suffers from task interference.
- The research is published on arXiv with ID 2605.31286.
Entities
Institutions
- arXiv