New Benchmark Evaluates Deep Research Agents on Expert Consulting Tasks
A recent benchmark study published on arXiv assesses cutting-edge deep research agents (DRAs) in the context of structured analytical tasks commonly performed by management consultants. The evaluation assigns a score of 4.6 to Claude Opus, while OpenAI's o3-deep-research and Google Gemini 3.1 Pro are also tested across 42 prompts authored by experts, yielding a total of 126 responses. Each response is evaluated using deterministic ground-truth verifiers, averaging 13.8 per task, alongside a five-criterion rubric rated from 0 to 3 by subject matter experts. These evaluations are then combined into a Verifier-Rubric Score (VRS) on a scale from 0 to 100. The benchmark aims to address multi-document, decision-grade tasks overlooked by existing benchmarks, incorporating cognitive traps to discourage superficial pattern matching.
Key facts
- arXiv paper ID: 2605.17554
- Evaluates Claude Opus 4.6, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research
- 42 SME-authored prompts used
- 126 total responses scored
- Average 13.8 deterministic verifiers per task
- Five-criterion 0-3 SME rubric applied
- Verifier-Rubric Score (VRS) on 0-100 scale
- Cognitive traps embedded in prompts
Entities
Institutions
- arXiv