MAVEN: AI Pipeline Generates Video Reasoning Training Data
MAVEN (Multi-stage Agentic Video Event aNnotation) is a sophisticated pipeline that transforms unprocessed videos into multi-task training datasets for Vision Language Models (VLMs). It produces Chain-of-Thought (CoT) reasoning pathways centered on a specific Event of Focus. At its foundation, MAVEN creates a Multi-Scale Spatio-Temporal Event Description (MSTED) derived from three interrelated caption tiers, which is the exclusive input for generating Q&A across various task formats. The pipeline enables agent-driven domain adaptation; when presented with a new video dataset and target question samples, the agent reconfigures all prompts from the top down without the need for manual adjustments. Additionally, a hierarchical refinement loop categorizes the outputs, fulfilling the demand for high-quality structured annotations that detail what occurred, when, where, why, and the resulting consequences, at a scale beyond manual labeling capabilities.
Key facts
- MAVEN is a multi-stage agentic pipeline for video event annotation.
- It generates Chain-of-Thought reasoning traces for VLMs.
- The pipeline synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED).
- MSTED uses three complementary caption levels.
- MAVEN supports agent-driven domain adaptation without manual re-engineering.
- It includes a hierarchical refinement loop for classification.
- The system addresses the need for scalable structured annotations.
- The paper is available on arXiv with ID 2605.21917.
Entities
Institutions
- arXiv