AtelierEval: Benchmarking Prompting Proficiency in Text-to-Image Systems
Researchers have introduced a new evaluation tool called AtelierEval, aimed at measuring how well both humans and multimodal large language models (MLLMs) perform in text-to-image (T2I) tasks. Unlike previous tools that focus solely on fixed prompts, AtelierEval features 360 carefully crafted tasks from a cognitive angle, split into three categories that tackle real-world issues. It provides a dual interface for both human users and MLLMs. To improve scalability and reliability, the team developed AtelierJudge, an evaluator that assigns both subjective and objective scores to prompt-image combinations, achieving a Spearman correlation of 0.79 with human judges. The research paper is available on arXiv under the identifier 2605.22645.
Key facts
- AtelierEval is the first unified benchmark for prompting proficiency in T2I systems.
- It includes 360 expert-crafted tasks across three categories.
- AtelierJudge is a skill-based, memory-augmented agentic evaluator.
- AtelierJudge achieves Spearman correlation of 0.79 with human experts.
- The benchmark has a dual interface for humans and MLLMs.
- 8 MLLMs were benchmarked in extensive experiments.
- The research is published on arXiv (2605.22645).
- Current benchmarks only evaluate T2I models, not prompters.
Entities
Institutions
- arXiv