REALM: Cross-Modal Framework Aligns Event Camera Data with RGB Foundation Models
A research paper introduces REALM, a cross-modal framework that aligns event camera data with RGB foundation models using low-rank adaptation (LoRA). Event cameras offer high temporal resolution, low latency, and robustness to extreme lighting, but existing learning approaches are task-specific. REALM projects event representations into the pretrained latent space of ViT-based RGB backbones, enabling downstream tasks like depth estimation and semantic segmentation without task-specific training. The method leverages frozen RGB models' geometric and semantic priors for asynchronous event streams. Published on arXiv (2605.00271).
Key facts
- REALM stands for RGB and Event Aligned Latent Manifold.
- Event cameras provide high temporal resolution, low latency, and robustness to extreme lighting.
- Existing event processing approaches are task-specific and lack cross-modal generalization.
- REALM uses low-rank adaptation (LoRA) to bridge the modality gap.
- The framework projects event representations into the latent space of RGB foundation models.
- REALM maps events into a ViT-based foundation latent space.
- Downstream tasks include depth estimation and semantic segmentation.
- The paper is available on arXiv with ID 2605.00271.
Entities
Institutions
- arXiv