M2R2: A Multimodal Feature Extractor for Robotic Action Segmentation

other · 2026-04-30

Researchers propose M2R2, a multimodal feature extractor for temporal action segmentation (TAS) in robotics. The model integrates proprioceptive and exteroceptive sensor data, addressing limitations of existing approaches that fuse features within models, hindering reuse, and of vision-only extractors that fail with poor object visibility. A novel training strategy enables reuse of learned features across different models. The work targets the intersection of robotics and computer vision, where TAS is key for skill boundary detection. The paper is available on arXiv.

Key facts

M2R2 is a multimodal feature extractor for temporal action segmentation.
It combines proprioceptive and exteroceptive sensor information.
Existing multimodal TAS models fuse features within the model, limiting reuse.
Vision-only extractors struggle when object visibility is limited.
A novel training strategy enables reuse of learned features.
TAS is a key research area in robotics and computer vision.
The paper is available on arXiv.
The arXiv ID is 2504.18662.

M2R2: A Multimodal Feature Extractor for Robotic Action Segmentation

Key facts

Entities

Institutions

Sources