AVP: Visual Primitives for Robotic Manipulation

other · 2026-05-23

A new architecture called AVP (Action with Visual Primitives) for robotic manipulation is proposed. It decouples visual-language understanding from motor control by having the VLM output visual-primitive tokens that condition a flow-matching action expert. This approach aims to improve learning efficiency and generalization compared to entangled architectures. Real-robot experiments on pick-and-place tasks were conducted.

Key facts

AVP stands for Action with Visual Primitives
The architecture is end-to-end
VLM infers next-stage target and emits visual-primitive tokens
Flow-matching action expert is conditioned on these tokens
Supervision derived from end-effector kinematics
Real-robot experiments on general pick-and-place
arXiv paper ID: 2605.22183
Announce type: cross

Entities

—

Sources

arXiv cs.AI — 2026-05-23