Response-G1: Explicit Scene Graph Modeling for Proactive Video Understanding
A novel framework named Response-G1 has been launched to enhance proactive understanding of streaming video, allowing Video-LLMs to determine the appropriate moments to react as the video progresses. This approach, outlined in a paper on arXiv (2605.07575), employs a clear, structured alignment between the gathered video data and query parameters through scene graphs. It functions in three stages without fine-tuning: generating scene graphs from streaming clips guided by queries, retrieving relevant historical scene graphs from memory, and utilizing retrieval-augmented prompts for making silence or response decisions on a per-frame basis. By anchoring both evidence and conditions in a unified graph representation, Response-G1 provides more precise and interpretable response timing. Experimental evaluations on standard benchmarks demonstrate its advantages over current implicit, query-agnostic techniques.
Key facts
- Response-G1 is a framework for proactive streaming video understanding.
- It uses explicit scene graph modeling for alignment between video evidence and query conditions.
- The framework operates in three fine-tuning-free stages.
- Stages include online scene graph generation, memory-based retrieval, and trigger prompting.
- It enables per-frame silence/response decisions.
- The approach is more interpretable and accurate than existing methods.
- Experimental results on benchmarks demonstrate superiority.
- The paper is available on arXiv with ID 2605.07575.
Entities
Institutions
- arXiv