MCF-Proto: Motion-Centric Action Frames for VLA Models

other · 2026-05-13

The MCF-Proto, a new lightweight action head, enhances Vision-Language-Action (VLA) models by introducing a Motion-Centric Action Frame (MCF) along with prototype-based action parameterization. Rather than forecasting actions within a static world coordinate frame, the policy determines a rotation R_t in SO(3), constructs actions in the adjusted local frame using prototypes, and subsequently translates them back to the world frame for comprehensive training based solely on standard demonstrations. This innovative design fosters a stable emergent structure: the learned local frames create axes that align closely with end-effector motion, even in the absence of explicit directional labels. This method effectively addresses the uniformity present in existing VLA action heads.

Key facts

MCF-Proto is a lightweight action head for VLA models.
It uses a Motion-Centric Action Frame (MCF) and prototype-based action parameterization.
The policy predicts a rotation R_t in SO(3) at each step.
Actions are composed in the transformed local frame from prototypes.
Training is end-to-end using only standard demonstrations without auxiliary supervision.
Learned local frames develop stable geometric structure compatible with end-effector motion.
No explicit directional labels are needed for this emergent structure.
The approach addresses homogeneity in current VLA action heads.

Entities

—

Sources

arXiv cs.AI — 2026-05-13