Response-G1: Explicit Scene Graph Modeling for Proactive Video Understanding

digital · 2026-05-11

A novel framework named Response-G1 has been launched to enhance proactive understanding of streaming video, allowing Video-LLMs to determine the appropriate moments to react as the video progresses. This approach, outlined in a paper on arXiv (2605.07575), employs a clear, structured alignment between the gathered video data and query parameters through scene graphs. It functions in three stages without fine-tuning: generating scene graphs from streaming clips guided by queries, retrieving relevant historical scene graphs from memory, and utilizing retrieval-augmented prompts for making silence or response decisions on a per-frame basis. By anchoring both evidence and conditions in a unified graph representation, Response-G1 provides more precise and interpretable response timing. Experimental evaluations on standard benchmarks demonstrate its advantages over current implicit, query-agnostic techniques.

Key facts

Response-G1 is a framework for proactive streaming video understanding.
It uses explicit scene graph modeling for alignment between video evidence and query conditions.
The framework operates in three fine-tuning-free stages.
Stages include online scene graph generation, memory-based retrieval, and trigger prompting.
It enables per-frame silence/response decisions.
The approach is more interpretable and accurate than existing methods.
Experimental results on benchmarks demonstrate superiority.
The paper is available on arXiv with ID 2605.07575.

Response-G1: Explicit Scene Graph Modeling for Proactive Video Understanding

Key facts

Entities

Institutions

Sources