DeepWeb-Bench: New Benchmark Challenges AI Deep Research Capabilities
Researchers have introduced DeepWeb-Bench, a new benchmark designed to evaluate the deep research capabilities of frontier language models. Unlike existing benchmarks, DeepWeb-Bench requires massive evidence collection from multiple web sources, cross-source reconciliation, and long-horizon multi-step derivation. The benchmark categorizes difficulty into four capability families: Retrieval, Derivation, Reasoning, and Calibration. Each reference answer includes a source-provenance record with four disclosure levels and cross-source checks. The benchmark aims to distinguish model performance where current evaluations fall short.
Key facts
- DeepWeb-Bench is a deep research benchmark for frontier language models.
- It requires massive cross-source evidence collection.
- Tasks involve long-horizon multi-step derivation.
- Four capability families: Retrieval, Derivation, Reasoning, Calibration.
- Reference answers include source-provenance records with four disclosure levels.
- Cross-source checks are available where possible.
- Designed to be substantially harder than existing benchmarks.
- Aims to distinguish capabilities of current frontier models.
Entities
—