New Benchmark Tests AI's Aesthetic Judgment Against Human Experts

publication · 2026-05-14

A new dataset, known as the Visual Aesthetic Benchmark (VAB), has been developed by researchers to assess the ability of multimodal large language models (MLLMs) to make aesthetic evaluations similar to those of human experts. This study, which appears on arXiv, critiques the prevalent method of summarizing aesthetic assessments into a single score for each image. In a controlled experiment involving eight expert annotators, rankings based on scores showed poor alignment with direct comparisons by the same experts. Direct ranking yielded significantly greater agreement among annotators regarding the best and worst images. The VAB comprises 400 tasks and 1,195 images from fine art, photography, and illustration, with labels reflecting the consensus of ten independent experts, aiming to enhance the accuracy of aesthetic judgment in AI.

Key facts

Visual Aesthetic Benchmark (VAB) introduced
Evaluates MLLMs on aesthetic judgment
Study published on arXiv
Eight expert annotators participated
Score-derived rankings align poorly with direct comparisons
Direct ranking yields higher inter-annotator agreement
VAB contains 400 tasks and 1,195 images
Images span fine art, photography, and illustration
Labels derived from consensus of ten expert annotators

New Benchmark Tests AI's Aesthetic Judgment Against Human Experts

Key facts

Entities

Institutions

Sources