MOV-Bench and AOP-Agent: Advancing Multi-Hop Audio-Visual Reasoning
MOV-Bench, a newly established benchmark featuring 519 carefully selected questions, evaluates multi-hop reasoning across temporally scattered audio-visual data. Assessment results indicate that existing Omni-LLMs face challenges in cross-modal reasoning. To tackle this issue, researchers introduce AOP-Agent, a streamlined agentic framework leveraging open-source Omni-LLMs for active omni-modal perception. This innovative AOP-Agent integrates a hierarchical omni-modal memory system with a collaborative observe-reflect-replan cycle, allowing open-source Omni-LLMs to engage in active perception effectively.
Key facts
- MOV-Bench contains 519 curated questions
- Questions require multi-hop reasoning over temporally dispersed audio-visual evidence
- Current Omni-LLMs struggle with multi-hop cross-modal reasoning
- AOP-Agent is built on open-source Omni-LLMs
- AOP-Agent uses hierarchical omni-modal memory
- AOP-Agent employs a collaborative observe-reflect-replan loop
- The work is published on arXiv with ID 2605.28192
- The paper addresses challenges in multi-hop audio-visual reasoning
Entities
Institutions
- arXiv