ARTFEED — Contemporary Art Intelligence

MAVEN Framework Enhances Cultural Fidelity in Text-to-Video AI

ai-technology · 2026-05-20

A new framework called MAVEN has been developed by researchers to enhance cultural fidelity in text-to-video (T2V) generation. This system breaks down prompts into three key components: person, action, and location, which are managed by specialized agents that can operate either simultaneously or in succession. Additionally, the researchers have established a benchmark consisting of 243 culturally relevant prompts and 972 associated videos, representing three cultures (Chinese, American, Romanian), along with three action categories and scenarios that are both mono-cultural and cross-cultural. Evaluations using CLIP-based metrics, VLM-as-judge assessments, and video quality indicators reveal that the refinement through multiple agents, especially with parallel specialization, greatly enhances cultural relevance.

Key facts

  • MAVEN is a multi-agent prompt refinement framework for T2V generation.
  • It decomposes prompts into person, action, and location dimensions.
  • A benchmark of 243 prompts and 972 videos was created.
  • Cultures covered: Chinese, American, Romanian.
  • Three action categories are included.
  • Both mono-cultural and cross-cultural scenarios are evaluated.
  • Parallel specialization outperforms other configurations.
  • Evaluations used CLIP-based metrics, VLM-as-judge, and video quality measures.

Entities

Sources