AtelierEval: Benchmarking Prompting Proficiency in Text-to-Image Systems

ai-technology · 2026-05-23

Researchers have introduced a new evaluation tool called AtelierEval, aimed at measuring how well both humans and multimodal large language models (MLLMs) perform in text-to-image (T2I) tasks. Unlike previous tools that focus solely on fixed prompts, AtelierEval features 360 carefully crafted tasks from a cognitive angle, split into three categories that tackle real-world issues. It provides a dual interface for both human users and MLLMs. To improve scalability and reliability, the team developed AtelierJudge, an evaluator that assigns both subjective and objective scores to prompt-image combinations, achieving a Spearman correlation of 0.79 with human judges. The research paper is available on arXiv under the identifier 2605.22645.

Key facts

AtelierEval is the first unified benchmark for prompting proficiency in T2I systems.
It includes 360 expert-crafted tasks across three categories.
AtelierJudge is a skill-based, memory-augmented agentic evaluator.
AtelierJudge achieves Spearman correlation of 0.79 with human experts.
The benchmark has a dual interface for humans and MLLMs.
8 MLLMs were benchmarked in extensive experiments.
The research is published on arXiv (2605.22645).
Current benchmarks only evaluate T2I models, not prompters.

AtelierEval: Benchmarking Prompting Proficiency in Text-to-Image Systems

Key facts

Entities

Institutions

Sources