SUGAR: A Scalable Framework for Humanoid Loco-Manipulation from Human Videos

ai-technology · 2026-05-22

A new framework named SUGAR has been developed by researchers to transform various human videos into usable humanoid loco-manipulation skills, eliminating the need for task-specific reward engineering or reference-motion conditioning during inference. This framework tackles the issue of creating humanoid robots that can perform generalizable whole-body loco-manipulation in real-world scenarios. Traditional methods often depend on tedious reward engineering or inflexible motion replay, which lack scalability. While human videos showcase a range of behaviors, the motion priors derived from them are hindered by occlusion, contact artifacts, and retargeting errors, making them ineffective for direct policy learning. SUGAR operates in three phases, beginning with an automated pipeline that extracts kinematic interaction priors, including human-object motion trajectories. This innovative approach facilitates scalable learning from the vast amount of human video data available. The findings were published on arXiv with the identifier 2605.20373.

Key facts

SUGAR is a scalable data-driven framework for humanoid loco-manipulation.
It converts diverse human videos into deployable skills without task-specific reward engineering.
No reference-motion conditioning is needed at inference.
Existing methods rely on laborious reward engineering, rigid motion replay, or costly teleoperation.
Motion priors from human videos suffer from occlusion, contact artifacts, and retargeting errors.
SUGAR proceeds in three stages, starting with extracting kinematic interaction priors.
The framework enables scalable learning from abundant human video data.
Published on arXiv with identifier 2605.20373.

SUGAR: A Scalable Framework for Humanoid Loco-Manipulation from Human Videos

Key facts

Entities

Institutions

Sources