Dr-CiK Benchmark Tests AI Agents on Real-World Forecasting
Researchers have introduced Dr-CiK, a new benchmark designed to evaluate whether AI agents can autonomously retrieve and use supporting context for time series forecasting. Unlike existing benchmarks that assume context is already provided, Dr-CiK requires agents to search a document corpus, filter out irrelevant information, distill useful evidence, and generate forecasts. Tests combining state-of-the-art deep research and forecasting methods show that high-quality context significantly improves performance, but most agents recover only a small fraction of the ground-truth context. The benchmark aims to bridge the gap between controlled forecasting tasks and real-world scenarios where context must be actively discovered from noisy, heterogeneous sources.
Key facts
- Dr-CiK is a benchmark for evaluating context-aided forecasting agents.
- It requires agents to retrieve, filter, distill, and use context from a document corpus.
- Existing benchmarks assume supporting context is already provided.
- High-quality context substantially improves forecasting performance in Dr-CiK.
- Most deep research agents recover only a small fraction of ground-truth context.
- The benchmark addresses real-world forecasting where context must be actively discovered.
- State-of-the-art deep research and forecasting methods were evaluated.
- The study was published on arXiv with ID 2605.27904.
Entities
Institutions
- arXiv