Data Ordering in LLM Pre-Training Affects Temporal Knowledge

ai-technology · 2026-05-23

A new study from arXiv (2605.22769) investigates how the order of training data impacts the temporal grounding of large language models (LLMs). Researchers introduced a benchmark of over 7,000 temporally grounded questions to evaluate whether models correctly associate facts with their time periods. They pre-trained 6B-parameter models on temporally ordered Common Crawl snapshots and compared them to standard shuffled pre-training. Results showed that sequentially trained models matched shuffled baselines on general language understanding while exhibiting more up-to-date and temporally precise knowledge. The work highlights the importance of data temporality in LLM pre-training.

Key facts

arXiv paper 2605.22769 studies data temporality impact on LLM pre-training.
Benchmark of over 7,000 temporally grounded questions created.
6B-parameter models pre-trained on temporally ordered Common Crawl snapshots.
Sequentially trained models matched shuffled baselines on general language understanding.
Sequential training led to more up-to-date and temporally precise knowledge.
Standard shuffled pre-training freezes knowledge at train time.
Temporal grounding of LLMs remains poorly understood.
Evaluation protocol enables analysis of fact-time period associations.

Data Ordering in LLM Pre-Training Affects Temporal Knowledge

Key facts

Entities

Institutions

Sources