Bench to the Future 2: New Benchmark for Forecasting Agent Reasoning

ai-technology · 2026-04-30

A new benchmark, Bench to the Future 2 (BTF-2), has been introduced to evaluate strategic reasoning in forecasting agents. It comprises 1,417 pastcasting questions paired with a frozen 15-million-document research corpus, enabling reproducible offline research and forecasting with full reasoning traces. BTF-2 can detect accuracy differences as small as 0.004 Brier score and distinguish between agent strengths in research versus judgment. Researchers built a forecaster that is 0.011 Brier more accurate than any single frontier agent, using it to assess strategic reasoning without hindsight bias. The superior forecaster primarily differs in its pre-mortem analysis of blind spots and consideration of black swan events. Expert human forecasters identified that the dominant strategic reasoning failures of frontier agents lie in assessing political and business leaders' incentives and judging their likelihood to follow through on stated commitments.

Key facts

BTF-2 includes 1,417 pastcasting questions.
The research corpus is frozen at 15 million documents.
BTF-2 detects accuracy differences of 0.004 Brier score.
The best forecaster is 0.011 Brier more accurate than any single frontier agent.
Superior forecaster excels in pre-mortem analysis of blind spots and black swans.
Frontier agents fail in assessing leaders' incentives and follow-through.
The benchmark allows offline, reproducible research and forecasting.
BTF-2 distinguishes between research and judgment strengths.

Bench to the Future 2: New Benchmark for Forecasting Agent Reasoning

Key facts

Entities

Institutions

Sources