Data Ordering in LLM Pre-Training Affects Temporal Knowledge
A new study from arXiv (2605.22769) investigates how the order of training data impacts the temporal grounding of large language models (LLMs). Researchers introduced a benchmark of over 7,000 temporally grounded questions to evaluate whether models correctly associate facts with their time periods. They pre-trained 6B-parameter models on temporally ordered Common Crawl snapshots and compared them to standard shuffled pre-training. Results showed that sequentially trained models matched shuffled baselines on general language understanding while exhibiting more up-to-date and temporally precise knowledge. The work highlights the importance of data temporality in LLM pre-training.
Key facts
- arXiv paper 2605.22769 studies data temporality impact on LLM pre-training.
- Benchmark of over 7,000 temporally grounded questions created.
- 6B-parameter models pre-trained on temporally ordered Common Crawl snapshots.
- Sequentially trained models matched shuffled baselines on general language understanding.
- Sequential training led to more up-to-date and temporally precise knowledge.
- Standard shuffled pre-training freezes knowledge at train time.
- Temporal grounding of LLMs remains poorly understood.
- Evaluation protocol enables analysis of fact-time period associations.
Entities
Institutions
- arXiv
- Common Crawl