MOV-Bench and AOP-Agent: Advancing Multi-Hop Audio-Visual Reasoning

ai-technology · 2026-05-28

MOV-Bench, a newly established benchmark featuring 519 carefully selected questions, evaluates multi-hop reasoning across temporally scattered audio-visual data. Assessment results indicate that existing Omni-LLMs face challenges in cross-modal reasoning. To tackle this issue, researchers introduce AOP-Agent, a streamlined agentic framework leveraging open-source Omni-LLMs for active omni-modal perception. This innovative AOP-Agent integrates a hierarchical omni-modal memory system with a collaborative observe-reflect-replan cycle, allowing open-source Omni-LLMs to engage in active perception effectively.

Key facts

MOV-Bench contains 519 curated questions
Questions require multi-hop reasoning over temporally dispersed audio-visual evidence
Current Omni-LLMs struggle with multi-hop cross-modal reasoning
AOP-Agent is built on open-source Omni-LLMs
AOP-Agent uses hierarchical omni-modal memory
AOP-Agent employs a collaborative observe-reflect-replan loop
The work is published on arXiv with ID 2605.28192
The paper addresses challenges in multi-hop audio-visual reasoning

MOV-Bench and AOP-Agent: Advancing Multi-Hop Audio-Visual Reasoning

Key facts

Entities

Institutions

Sources