MAVEN Framework Enhances Cultural Fidelity in Text-to-Video AI
A new framework called MAVEN has been developed by researchers to enhance cultural fidelity in text-to-video (T2V) generation. This system breaks down prompts into three key components: person, action, and location, which are managed by specialized agents that can operate either simultaneously or in succession. Additionally, the researchers have established a benchmark consisting of 243 culturally relevant prompts and 972 associated videos, representing three cultures (Chinese, American, Romanian), along with three action categories and scenarios that are both mono-cultural and cross-cultural. Evaluations using CLIP-based metrics, VLM-as-judge assessments, and video quality indicators reveal that the refinement through multiple agents, especially with parallel specialization, greatly enhances cultural relevance.
Key facts
- MAVEN is a multi-agent prompt refinement framework for T2V generation.
- It decomposes prompts into person, action, and location dimensions.
- A benchmark of 243 prompts and 972 videos was created.
- Cultures covered: Chinese, American, Romanian.
- Three action categories are included.
- Both mono-cultural and cross-cultural scenarios are evaluated.
- Parallel specialization outperforms other configurations.
- Evaluations used CLIP-based metrics, VLM-as-judge, and video quality measures.
Entities
—