Auto-Discovery-Bench: Benchmark for Structured State Tracking in Oracle-Guided Discovery

other · 2026-06-01

The Auto-Discovery-Bench is a novel standard designed to assess how well agents can sustain and revise structured beliefs in the context of interactive discovery. This benchmark employs a deterministic oracle-guided framework, allowing agents to uncover concealed structures via cycles of hypothesis, intervention, and feedback. It encompasses three types of discovery: directed graph, undirected relational, and symbolic equation discovery. Results indicate that performance declines with an increase in variables, extended trajectories, and additional distractors. A diagnostic focused on trajectory tracking shows that issues continue to occur even when intervention selection and hypothesis generation are excluded, highlighting challenges in sustaining and integrating long-range structured states.

Key facts

Auto-Discovery-Bench is a deterministic oracle-guided diagnostic benchmark.
It involves repeated hypothesis-intervention-feedback cycles.
Three discovery abstractions: directed graph, undirected relational, symbolic equation.
Performance degrades with more variables, longer trajectories, and more distractors.
A trajectory-tracking diagnostic isolates state tracking from other capabilities.
Failures persist even without intervention selection and hypothesis generation.
Limitations are in maintaining and integrating long-range structured state.
The paper is on arXiv with ID 2502.15224.

Auto-Discovery-Bench: Benchmark for Structured State Tracking in Oracle-Guided Discovery

Key facts

Entities

Institutions

Sources