MAVEN Framework Enhances Cultural Fidelity in Text-to-Video AI

ai-technology · 2026-05-20

A new framework called MAVEN has been developed by researchers to enhance cultural fidelity in text-to-video (T2V) generation. This system breaks down prompts into three key components: person, action, and location, which are managed by specialized agents that can operate either simultaneously or in succession. Additionally, the researchers have established a benchmark consisting of 243 culturally relevant prompts and 972 associated videos, representing three cultures (Chinese, American, Romanian), along with three action categories and scenarios that are both mono-cultural and cross-cultural. Evaluations using CLIP-based metrics, VLM-as-judge assessments, and video quality indicators reveal that the refinement through multiple agents, especially with parallel specialization, greatly enhances cultural relevance.

Key facts

MAVEN is a multi-agent prompt refinement framework for T2V generation.
It decomposes prompts into person, action, and location dimensions.
A benchmark of 243 prompts and 972 videos was created.
Cultures covered: Chinese, American, Romanian.
Three action categories are included.
Both mono-cultural and cross-cultural scenarios are evaluated.
Parallel specialization outperforms other configurations.
Evaluations used CLIP-based metrics, VLM-as-judge, and video quality measures.

Entities

—

Sources

arXiv cs.AI — 2026-05-19