ARTFEED — Contemporary Art Intelligence

HiCrew: Hierarchical Multi-Agent Framework for Long-Form Video Understanding

ai-technology · 2026-04-25

HiCrew, a new hierarchical multi-agent framework, has been developed by researchers to enhance the understanding of long-form videos by tackling issues of spatiotemporal redundancy and narrative dependencies. This framework, outlined in a preprint on arXiv (2604.21444), presents three primary innovations: a Hybrid Tree structure that utilizes shot boundary detection to maintain temporal topology while conducting relevance-guided hierarchical clustering within semantically coherent segments; a Question-Aware Captioning system that generates intent-driven visual descriptions; and a collaborative multi-agent system that tailors reasoning strategies to specific questions, addressing the limitations of rigid workflows in current multi-agent frameworks. This research aims to improve causal reasoning over extended timeframes, often overlooked in compressed visual information. The paper is authored by researchers and available on arXiv.

Key facts

  • HiCrew is a hierarchical multi-agent framework for long-form video understanding.
  • It addresses spatiotemporal redundancy and intricate narrative dependencies.
  • The framework uses a Hybrid Tree structure with shot boundary detection.
  • It preserves temporal topology while performing relevance-guided hierarchical clustering.
  • A Question-Aware Captioning mechanism synthesizes intent-driven visual descriptions.
  • The multi-agent system adapts reasoning strategies to question-specific demands.
  • Existing multi-agent frameworks use rigid, pre-defined workflows.
  • The paper is available on arXiv with ID 2604.21444.

Entities

Institutions

  • arXiv

Sources