COMODO Framework Enables Efficient Human Activity Recognition via Video-to-IMU Knowledge Transfer

ai-technology · 2026-04-22

A new research framework called COMODO addresses the limitations of wearable human activity recognition systems by transferring semantic knowledge from video to inertial measurement units without labeled data. Egocentric video models capture rich information but suffer from high power consumption, privacy issues, and lighting dependencies, making continuous on-device recognition impractical. In contrast, IMU sensors are energy-efficient and privacy-preserving but lack large-scale annotated datasets, resulting in weaker generalization. COMODO bridges this gap through cross-modal self-supervised distillation, using a pretrained and frozen video encoder to align feature distributions between video and IMU embeddings. This approach constructs a dynamic instance queue to facilitate knowledge transfer, enabling more efficient activity understanding for human-centered wearable systems. The research was published on arXiv under identifier 2503.07259v2, categorized as a replacement cross announcement type.

Key facts

COMODO is a cross-modal self-supervised distillation framework
Transfers semantic knowledge from video to IMU without requiring labels
Addresses trade-off between video-based models and IMU sensors for human activity recognition
Video models have high power consumption, privacy concerns, and lighting dependence
IMU sensors are energy-efficient and privacy-preserving but lack large annotated datasets
Uses pretrained frozen video encoder to align feature distributions
Constructs dynamic instance queue for video-IMU embedding alignment
Research published on arXiv with identifier 2503.07259v2

COMODO Framework Enables Efficient Human Activity Recognition via Video-to-IMU Knowledge Transfer

Key facts

Entities

Institutions

Sources