CLAMP: 3D Pretraining Framework for Robotic Manipulation

ai-technology · 2026-05-01

A new framework named CLAMP (Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining) has been developed by researchers to enhance 3D representations for robotic manipulation using point clouds and robot actions. Unlike traditional methods that depend on 2D images, CLAMP integrates point clouds derived from RGB-D images along with camera extrinsics. It then re-renders multi-view, four-channel images that include depth and 3D coordinates, featuring dynamic wrist perspectives. This approach allows for improved visibility of target objects, facilitating high-precision tasks. Through contrastive learning, the pre-trained encoders establish links between 3D geometric and positional data and robot action patterns. This research is available on arXiv (2602.00937v2).

Key facts

CLAMP is a 3D pre-training framework for robotic manipulation.
It uses point clouds and robot actions.
It re-renders multi-view four-channel images with depth and 3D coordinates.
Dynamic wrist views are included for high-precision tasks.
Pre-trained encoders use contrastive learning.
Published on arXiv with ID 2602.00937v2.
Addresses limitations of 2D image representations.
Captures 3D spatial information about objects and scenes.

CLAMP: 3D Pretraining Framework for Robotic Manipulation

Key facts

Entities

Institutions

Sources