NanoKnow Benchmark Tracks LLM Knowledge Sources

ai-technology · 2026-05-01

A new benchmark dataset named NanoKnow has been introduced by researchers to investigate the origins of knowledge within large language models (LLMs). This dataset categorizes questions from Natural Questions and SQuAD according to whether their responses are found in the pre-training corpus of nanochat, a series of small LLMs that utilize fully accessible pre-training data. Such transparency enables researchers to differentiate between parametric knowledge (acquired during pre-training) and other information sources. Investigations involving eight nanochat checkpoints indicate that the accuracy of closed-book questions is significantly affected by the frequency of answers in the pre-training data. This research tackles the persistent issue of comprehending how LLMs store knowledge, with potential benefits for enhancing model reliability and interpretability.

Key facts

NanoKnow is a benchmark dataset from arXiv:2602.20122.
It partitions questions from Natural Questions and SQuAD.
Splits are based on answer presence in nanochat's pre-training corpus.
Nanochat is a family of small LLMs with fully open pre-training data.
Closed-book accuracy is strongly influenced by answer frequency.
Experiments used eight nanochat checkpoints.
The research aims to understand how knowledge is encoded by LLMs.
Pre-training data is often a black box.

NanoKnow Benchmark Tracks LLM Knowledge Sources

Key facts

Entities

Institutions

Sources