New Benchmark Evaluates Deep Research Agents on Expert Consulting Tasks

ai-technology · 2026-05-20

A recent benchmark study published on arXiv assesses cutting-edge deep research agents (DRAs) in the context of structured analytical tasks commonly performed by management consultants. The evaluation assigns a score of 4.6 to Claude Opus, while OpenAI's o3-deep-research and Google Gemini 3.1 Pro are also tested across 42 prompts authored by experts, yielding a total of 126 responses. Each response is evaluated using deterministic ground-truth verifiers, averaging 13.8 per task, alongside a five-criterion rubric rated from 0 to 3 by subject matter experts. These evaluations are then combined into a Verifier-Rubric Score (VRS) on a scale from 0 to 100. The benchmark aims to address multi-document, decision-grade tasks overlooked by existing benchmarks, incorporating cognitive traps to discourage superficial pattern matching.

Key facts

arXiv paper ID: 2605.17554
Evaluates Claude Opus 4.6, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research
42 SME-authored prompts used
126 total responses scored
Average 13.8 deterministic verifiers per task
Five-criterion 0-3 SME rubric applied
Verifier-Rubric Score (VRS) on 0-100 scale
Cognitive traps embedded in prompts

New Benchmark Evaluates Deep Research Agents on Expert Consulting Tasks

Key facts

Entities

Institutions

Sources