AutoResearchBench: A Benchmark for AI-Driven Scientific Literature Discovery
AutoResearchBench is a new benchmark designed to evaluate AI agents' ability to autonomously discover scientific literature. It comprises two task types: Deep Research, which involves locating a specific target paper through iterative probing, and Wide Research, which requires collecting a comprehensive set of papers meeting given criteria. Unlike prior web browsing benchmarks, AutoResearchBench emphasizes research-oriented tasks demanding deep comprehension of scientific concepts. The benchmark aims to advance autonomous scientific research by testing AI agents' capability in navigating complex literature landscapes.
Key facts
- AutoResearchBench is a benchmark for autonomous scientific literature discovery.
- It includes Deep Research and Wide Research tasks.
- Deep Research requires tracking down a specific target paper via multi-step probing.
- Wide Research involves collecting a set of papers satisfying given conditions.
- The benchmark is research-oriented, focusing on in-depth comprehension of scientific concepts.
- It distinguishes itself from previous agentic web browsing benchmarks.
- The goal is to advance autonomous scientific research.
- The benchmark assesses AI agents' capability in scientific literature discovery.
Entities
—