RE-VLM: First Dual-Stream VLM for Event Camera Scene Understanding

ai-technology · 2026-05-20

A team of researchers has introduced RE-VLM, the inaugural dual-stream vision-language model designed to simultaneously analyze RGB images and event streams, enhancing scene comprehension in challenging conditions such as low light, high dynamic range, or rapid motion. Event cameras capture per-pixel brightness variations with exceptional temporal resolution and a broad dynamic range, maintaining motion information where traditional RGB images fall short. RE-VLM features concurrent RGB and event encoders and utilizes a progressive training approach to harmonize diverse visual features with language. To tackle the lack of RGB-Event-Text supervision, the researchers present a graph-driven method that transforms synchronized RGB-Event streams into reliable scene graphs for generating synthetic data. This research appears on arXiv (2605.19329) and seeks to improve VLM efficacy in difficult scenarios.

Key facts

RE-VLM is the first dual-stream vision-language model combining RGB and event streams.
Event cameras record per-pixel brightness changes asynchronously with high temporal resolution and wide dynamic range.
Standard RGB images degrade under adverse conditions like low light, high dynamic range, or fast motion.
RE-VLM uses parallel RGB and event encoders with a progressive training strategy.
A graph-driven pipeline converts synchronized RGB-Event streams into verifiable scene graphs.
The pipeline addresses the scarcity of RGB-Event-Text supervision.
The paper is published on arXiv with identifier 2605.19329.
The model targets robust scene understanding across both normal and challenging conditions.

RE-VLM: First Dual-Stream VLM for Event Camera Scene Understanding

Key facts

Entities

Institutions

Sources