ARTFEED — Contemporary Art Intelligence

AtelierEval: Benchmarking Prompting Proficiency in Text-to-Image Systems

ai-technology · 2026-05-23

Researchers have introduced a new evaluation tool called AtelierEval, aimed at measuring how well both humans and multimodal large language models (MLLMs) perform in text-to-image (T2I) tasks. Unlike previous tools that focus solely on fixed prompts, AtelierEval features 360 carefully crafted tasks from a cognitive angle, split into three categories that tackle real-world issues. It provides a dual interface for both human users and MLLMs. To improve scalability and reliability, the team developed AtelierJudge, an evaluator that assigns both subjective and objective scores to prompt-image combinations, achieving a Spearman correlation of 0.79 with human judges. The research paper is available on arXiv under the identifier 2605.22645.

Key facts

  • AtelierEval is the first unified benchmark for prompting proficiency in T2I systems.
  • It includes 360 expert-crafted tasks across three categories.
  • AtelierJudge is a skill-based, memory-augmented agentic evaluator.
  • AtelierJudge achieves Spearman correlation of 0.79 with human experts.
  • The benchmark has a dual interface for humans and MLLMs.
  • 8 MLLMs were benchmarked in extensive experiments.
  • The research is published on arXiv (2605.22645).
  • Current benchmarks only evaluate T2I models, not prompters.

Entities

Institutions

  • arXiv

Sources