Auto-Discovery-Bench: Benchmark for Structured State Tracking in Oracle-Guided Discovery
The Auto-Discovery-Bench is a novel standard designed to assess how well agents can sustain and revise structured beliefs in the context of interactive discovery. This benchmark employs a deterministic oracle-guided framework, allowing agents to uncover concealed structures via cycles of hypothesis, intervention, and feedback. It encompasses three types of discovery: directed graph, undirected relational, and symbolic equation discovery. Results indicate that performance declines with an increase in variables, extended trajectories, and additional distractors. A diagnostic focused on trajectory tracking shows that issues continue to occur even when intervention selection and hypothesis generation are excluded, highlighting challenges in sustaining and integrating long-range structured states.
Key facts
- Auto-Discovery-Bench is a deterministic oracle-guided diagnostic benchmark.
- It involves repeated hypothesis-intervention-feedback cycles.
- Three discovery abstractions: directed graph, undirected relational, symbolic equation.
- Performance degrades with more variables, longer trajectories, and more distractors.
- A trajectory-tracking diagnostic isolates state tracking from other capabilities.
- Failures persist even without intervention selection and hypothesis generation.
- Limitations are in maintaining and integrating long-range structured state.
- The paper is on arXiv with ID 2502.15224.
Entities
Institutions
- arXiv