ARTFEED — Contemporary Art Intelligence

XpertBench Introduces Rubric-Based Evaluation for LLMs on Expert-Level Tasks

ai-technology · 2026-04-22

XpertBench emerges as a new benchmark designed to evaluate large language models on complex, open-ended tasks that reflect genuine expert-level cognition. Developed to address limitations in existing frameworks—such as narrow domain coverage, reliance on generalist tasks, and self-evaluation biases—it comprises 1,346 meticulously curated tasks across 80 categories. These span finance, healthcare, legal services, education, and dual-track research in STEM and humanities. Tasks were derived from over 1,000 submissions by domain experts, including researchers from elite institutions and practitioners with clinical or industrial experience, ensuring superior ecological validity. Each task employs detailed rubrics with mostly 15-40 weighted checkpoints, aiming for high-fidelity assessment. The benchmark responds to plateauing LLM performance on conventional benchmarks, focusing on authentic professional domains. It was announced in arXiv:2604.02368v4 as a replacement abstract, highlighting its technical foundation. The initiative bridges a gap in evaluating LLM proficiency, emphasizing real-world applicability and expert-driven task design. This development underscores ongoing efforts to refine AI evaluation methodologies beyond simplistic metrics.

Key facts

  • XpertBench is a benchmark for evaluating LLMs on expert-level tasks
  • It includes 1,346 tasks across 80 categories
  • Categories cover finance, healthcare, legal services, education, and STEM/humanities research
  • Tasks are derived from over 1,000 submissions by domain experts
  • Experts include researchers from elite institutions and practitioners with clinical/industrial experience
  • Each task uses detailed rubrics with mostly 15-40 weighted checkpoints
  • It addresses plateauing LLM performance on conventional benchmarks
  • Announced in arXiv:2604.02368v4 as a replacement abstract

Entities

Sources