AVP: Visual Primitives for Robotic Manipulation
A new architecture called AVP (Action with Visual Primitives) for robotic manipulation is proposed. It decouples visual-language understanding from motor control by having the VLM output visual-primitive tokens that condition a flow-matching action expert. This approach aims to improve learning efficiency and generalization compared to entangled architectures. Real-robot experiments on pick-and-place tasks were conducted.
Key facts
- AVP stands for Action with Visual Primitives
- The architecture is end-to-end
- VLM infers next-stage target and emits visual-primitive tokens
- Flow-matching action expert is conditioned on these tokens
- Supervision derived from end-effector kinematics
- Real-robot experiments on general pick-and-place
- arXiv paper ID: 2605.22183
- Announce type: cross
Entities
—