XpertBench Introduces Rubric-Based Evaluation for LLMs on Expert-Level Tasks

ai-technology · 2026-04-22

XpertBench emerges as a new benchmark designed to evaluate large language models on complex, open-ended tasks that reflect genuine expert-level cognition. Developed to address limitations in existing frameworks—such as narrow domain coverage, reliance on generalist tasks, and self-evaluation biases—it comprises 1,346 meticulously curated tasks across 80 categories. These span finance, healthcare, legal services, education, and dual-track research in STEM and humanities. Tasks were derived from over 1,000 submissions by domain experts, including researchers from elite institutions and practitioners with clinical or industrial experience, ensuring superior ecological validity. Each task employs detailed rubrics with mostly 15-40 weighted checkpoints, aiming for high-fidelity assessment. The benchmark responds to plateauing LLM performance on conventional benchmarks, focusing on authentic professional domains. It was announced in arXiv:2604.02368v4 as a replacement abstract, highlighting its technical foundation. The initiative bridges a gap in evaluating LLM proficiency, emphasizing real-world applicability and expert-driven task design. This development underscores ongoing efforts to refine AI evaluation methodologies beyond simplistic metrics.

Key facts

XpertBench is a benchmark for evaluating LLMs on expert-level tasks
It includes 1,346 tasks across 80 categories
Categories cover finance, healthcare, legal services, education, and STEM/humanities research
Tasks are derived from over 1,000 submissions by domain experts
Experts include researchers from elite institutions and practitioners with clinical/industrial experience
Each task uses detailed rubrics with mostly 15-40 weighted checkpoints
It addresses plateauing LLM performance on conventional benchmarks
Announced in arXiv:2604.02368v4 as a replacement abstract

Entities

—

Sources

arXiv cs.AI — 2026-04-22