Hybrid Representation Learning for Robotic Manipulation

other · 2026-05-22

A new pretraining framework for robotic manipulation learns hybrid structural latent points by inserting a point-wise latent variational autoencoder into a point-cloud autoencoder's latent space. This approach jointly regularizes point-wise features and coordinates toward a Gaussian prior, producing a compact latent that captures coarse structural tendencies without precise geometry. The method combines the expressiveness of implicit neural fields with explicit geometric cues, addressing limitations of both fully implicit and fully explicit representations in 3D-aware pretraining. The framework aims to improve visual representations for embodied perception and manipulation tasks.

Key facts

The framework learns hybrid structural latent points.
It inserts a point-wise latent VAE into a point-cloud autoencoder's latent space.
Point-wise features and coordinates are regularized toward a Gaussian prior.
The resulting latent captures coarse structural tendencies and rough shape information.
It does not encode precise geometry.
The method addresses limitations of fully implicit and fully explicit representations.
It is designed for embodied perception and manipulation tasks.
The paper is available on arXiv with ID 2605.21258.

Entities

—

Sources

arXiv cs.AI — 2026-05-21