COMODO Framework Enables Efficient Human Activity Recognition via Video-to-IMU Knowledge Transfer
A new research framework called COMODO addresses the limitations of wearable human activity recognition systems by transferring semantic knowledge from video to inertial measurement units without labeled data. Egocentric video models capture rich information but suffer from high power consumption, privacy issues, and lighting dependencies, making continuous on-device recognition impractical. In contrast, IMU sensors are energy-efficient and privacy-preserving but lack large-scale annotated datasets, resulting in weaker generalization. COMODO bridges this gap through cross-modal self-supervised distillation, using a pretrained and frozen video encoder to align feature distributions between video and IMU embeddings. This approach constructs a dynamic instance queue to facilitate knowledge transfer, enabling more efficient activity understanding for human-centered wearable systems. The research was published on arXiv under identifier 2503.07259v2, categorized as a replacement cross announcement type.
Key facts
- COMODO is a cross-modal self-supervised distillation framework
- Transfers semantic knowledge from video to IMU without requiring labels
- Addresses trade-off between video-based models and IMU sensors for human activity recognition
- Video models have high power consumption, privacy concerns, and lighting dependence
- IMU sensors are energy-efficient and privacy-preserving but lack large annotated datasets
- Uses pretrained frozen video encoder to align feature distributions
- Constructs dynamic instance queue for video-IMU embedding alignment
- Research published on arXiv with identifier 2503.07259v2
Entities
Institutions
- arXiv