ARTFEED — Contemporary Art Intelligence

AVP: Visual Primitives for Robotic Manipulation

other · 2026-05-23

A new architecture called AVP (Action with Visual Primitives) for robotic manipulation is proposed. It decouples visual-language understanding from motor control by having the VLM output visual-primitive tokens that condition a flow-matching action expert. This approach aims to improve learning efficiency and generalization compared to entangled architectures. Real-robot experiments on pick-and-place tasks were conducted.

Key facts

  • AVP stands for Action with Visual Primitives
  • The architecture is end-to-end
  • VLM infers next-stage target and emits visual-primitive tokens
  • Flow-matching action expert is conditioned on these tokens
  • Supervision derived from end-effector kinematics
  • Real-robot experiments on general pick-and-place
  • arXiv paper ID: 2605.22183
  • Announce type: cross

Entities

Sources