Inverse Scaling: More Capable LLMs Produce Worse Forecasts on Superlinear Growth

ai-technology · 2026-05-23

A recent study published on arXiv (2605.22672) indicates that advanced language models tend to perform poorly in distributional predictions for scenarios characterized by superlinear growth and the potential for regime shifts, a situation often seen in finance and epidemiology. The authors present ForecastBench-Sim (FBSim), a benchmark designed for contamination-free simulations, and illustrate this phenomenon using synthetic SIR epidemic models alongside a corresponding linear control. The shortcomings are primarily observed in the upper tail, which more sophisticated models elevate to accommodate aggressive projections, while the lower tail remains stable. This trend is also evident in actual datasets concerning COVID-19, measles, real estate, and hyperinflation. An analysis of Llama-3.1 reveals that both model size and post-training factors contribute to this inverse scaling, with domain expertise failing to enhance calibration reliably.

Key facts

Inverse scaling in LLMs on forecasting problems with superlinear growth and tail risk
ForecastBench-Sim (FBSim) released as a contamination-free benchmark
Failure concentrates at the upper tail of distributional forecasts
Replicates on COVID-19, measles, housing markets, and hyperinflation datasets
Llama-3.1 study shows scale and post-training both contribute to the effect
Domain knowledge does not reliably rescue calibration

Inverse Scaling: More Capable LLMs Produce Worse Forecasts on Superlinear Growth

Key facts

Entities

Institutions

Sources