ARTFEED — Contemporary Art Intelligence

New Benchmark Evaluates Deep Research Agents on Expert Consulting Tasks

ai-technology · 2026-05-20

A recent benchmark study published on arXiv assesses cutting-edge deep research agents (DRAs) in the context of structured analytical tasks commonly performed by management consultants. The evaluation assigns a score of 4.6 to Claude Opus, while OpenAI's o3-deep-research and Google Gemini 3.1 Pro are also tested across 42 prompts authored by experts, yielding a total of 126 responses. Each response is evaluated using deterministic ground-truth verifiers, averaging 13.8 per task, alongside a five-criterion rubric rated from 0 to 3 by subject matter experts. These evaluations are then combined into a Verifier-Rubric Score (VRS) on a scale from 0 to 100. The benchmark aims to address multi-document, decision-grade tasks overlooked by existing benchmarks, incorporating cognitive traps to discourage superficial pattern matching.

Key facts

  • arXiv paper ID: 2605.17554
  • Evaluates Claude Opus 4.6, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research
  • 42 SME-authored prompts used
  • 126 total responses scored
  • Average 13.8 deterministic verifiers per task
  • Five-criterion 0-3 SME rubric applied
  • Verifier-Rubric Score (VRS) on 0-100 scale
  • Cognitive traps embedded in prompts

Entities

Institutions

  • arXiv

Sources