MAVEN: AI Pipeline Generates Video Reasoning Training Data

ai-technology · 2026-05-23

MAVEN (Multi-stage Agentic Video Event aNnotation) is a sophisticated pipeline that transforms unprocessed videos into multi-task training datasets for Vision Language Models (VLMs). It produces Chain-of-Thought (CoT) reasoning pathways centered on a specific Event of Focus. At its foundation, MAVEN creates a Multi-Scale Spatio-Temporal Event Description (MSTED) derived from three interrelated caption tiers, which is the exclusive input for generating Q&A across various task formats. The pipeline enables agent-driven domain adaptation; when presented with a new video dataset and target question samples, the agent reconfigures all prompts from the top down without the need for manual adjustments. Additionally, a hierarchical refinement loop categorizes the outputs, fulfilling the demand for high-quality structured annotations that detail what occurred, when, where, why, and the resulting consequences, at a scale beyond manual labeling capabilities.

Key facts

MAVEN is a multi-stage agentic pipeline for video event annotation.
It generates Chain-of-Thought reasoning traces for VLMs.
The pipeline synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED).
MSTED uses three complementary caption levels.
MAVEN supports agent-driven domain adaptation without manual re-engineering.
It includes a hierarchical refinement loop for classification.
The system addresses the need for scalable structured annotations.
The paper is available on arXiv with ID 2605.21917.

MAVEN: AI Pipeline Generates Video Reasoning Training Data

Key facts

Entities

Institutions

Sources