ARTFEED — Contemporary Art Intelligence

MOV-Bench and AOP-Agent: Advancing Multi-Hop Audio-Visual Reasoning

ai-technology · 2026-05-28

MOV-Bench, a newly established benchmark featuring 519 carefully selected questions, evaluates multi-hop reasoning across temporally scattered audio-visual data. Assessment results indicate that existing Omni-LLMs face challenges in cross-modal reasoning. To tackle this issue, researchers introduce AOP-Agent, a streamlined agentic framework leveraging open-source Omni-LLMs for active omni-modal perception. This innovative AOP-Agent integrates a hierarchical omni-modal memory system with a collaborative observe-reflect-replan cycle, allowing open-source Omni-LLMs to engage in active perception effectively.

Key facts

  • MOV-Bench contains 519 curated questions
  • Questions require multi-hop reasoning over temporally dispersed audio-visual evidence
  • Current Omni-LLMs struggle with multi-hop cross-modal reasoning
  • AOP-Agent is built on open-source Omni-LLMs
  • AOP-Agent uses hierarchical omni-modal memory
  • AOP-Agent employs a collaborative observe-reflect-replan loop
  • The work is published on arXiv with ID 2605.28192
  • The paper addresses challenges in multi-hop audio-visual reasoning

Entities

Institutions

  • arXiv

Sources