HiCrew: Hierarchical Multi-Agent Framework for Long-Form Video Understanding

ai-technology · 2026-04-25

HiCrew, a new hierarchical multi-agent framework, has been developed by researchers to enhance the understanding of long-form videos by tackling issues of spatiotemporal redundancy and narrative dependencies. This framework, outlined in a preprint on arXiv (2604.21444), presents three primary innovations: a Hybrid Tree structure that utilizes shot boundary detection to maintain temporal topology while conducting relevance-guided hierarchical clustering within semantically coherent segments; a Question-Aware Captioning system that generates intent-driven visual descriptions; and a collaborative multi-agent system that tailors reasoning strategies to specific questions, addressing the limitations of rigid workflows in current multi-agent frameworks. This research aims to improve causal reasoning over extended timeframes, often overlooked in compressed visual information. The paper is authored by researchers and available on arXiv.

Key facts

HiCrew is a hierarchical multi-agent framework for long-form video understanding.
It addresses spatiotemporal redundancy and intricate narrative dependencies.
The framework uses a Hybrid Tree structure with shot boundary detection.
It preserves temporal topology while performing relevance-guided hierarchical clustering.
A Question-Aware Captioning mechanism synthesizes intent-driven visual descriptions.
The multi-agent system adapts reasoning strategies to question-specific demands.
Existing multi-agent frameworks use rigid, pre-defined workflows.
The paper is available on arXiv with ID 2604.21444.

HiCrew: Hierarchical Multi-Agent Framework for Long-Form Video Understanding

Key facts

Entities

Institutions

Sources