NumLeak: Public Benchmarks as Latent Labels in Foundation Models

ai-technology · 2026-06-01

A recent investigation has unveiled NumLeak, a framework designed to assess that leading frontier LLMs tend to memorize numerical benchmarks from their pretraining datasets instead of exhibiting genuine out-of-sample capabilities. This framework integrates API-boundary probes on operational models with white-box validation on an open causal LM. Findings indicate that these models can recall the Fama-French market excess return with pooled Pearson r values ranging from 0.97 to 0.99, maintaining a margin of 0.15 within 25 basis points across five sibling factors. Similar accuracy is noted for U.S. unemployment, CPI inflation, and NOAA temperature. However, in a recent holdout, the parse rate drops to 21-57%, while the r value remains around 0.99 for months answered, suggesting a memorization bias. The white-box tests replicate the dose-response, and logprob ranking reveals memorization overlooked by open-ended generation, indicating that closed-API black-box probes may underestimate this channel. This study emphasizes that public numeric benchmarks present in pretraining lead evaluations conditioned on a date to reflect memorized recall rather than actual skill.

Key facts

NumLeak is a measurement framework for detecting memorization in LLMs.
Top-tier frontier LLMs recall Fama-French market excess return with Pearson r=0.97-0.99.
Models stay within 0.15 within-25bps on five sibling factors.
Comparable fidelity appears on U.S. unemployment, CPI inflation, and NOAA temperature.
On a recent-release holdout, parse rate collapses to 21-57% but r remains ~0.99 on answered months.
Refuse-or-recall asymmetry predicted by a memorized channel.
White-box experiment reproduces dose-response.
Logprob ranking detects memorization that open-ended generation misses.

NumLeak: Public Benchmarks as Latent Labels in Foundation Models

Key facts

Entities

Institutions

Sources