ARTFEED — Contemporary Art Intelligence

New Benchmark Tests AI's Aesthetic Judgment Against Human Experts

publication · 2026-05-14

A new dataset, known as the Visual Aesthetic Benchmark (VAB), has been developed by researchers to assess the ability of multimodal large language models (MLLMs) to make aesthetic evaluations similar to those of human experts. This study, which appears on arXiv, critiques the prevalent method of summarizing aesthetic assessments into a single score for each image. In a controlled experiment involving eight expert annotators, rankings based on scores showed poor alignment with direct comparisons by the same experts. Direct ranking yielded significantly greater agreement among annotators regarding the best and worst images. The VAB comprises 400 tasks and 1,195 images from fine art, photography, and illustration, with labels reflecting the consensus of ten independent experts, aiming to enhance the accuracy of aesthetic judgment in AI.

Key facts

  • Visual Aesthetic Benchmark (VAB) introduced
  • Evaluates MLLMs on aesthetic judgment
  • Study published on arXiv
  • Eight expert annotators participated
  • Score-derived rankings align poorly with direct comparisons
  • Direct ranking yields higher inter-annotator agreement
  • VAB contains 400 tasks and 1,195 images
  • Images span fine art, photography, and illustration
  • Labels derived from consensus of ten expert annotators

Entities

Institutions

  • arXiv

Sources