ARTFEED — Contemporary Art Intelligence

Data Ordering in LLM Pre-Training Affects Temporal Knowledge

ai-technology · 2026-05-23

A new study from arXiv (2605.22769) investigates how the order of training data impacts the temporal grounding of large language models (LLMs). Researchers introduced a benchmark of over 7,000 temporally grounded questions to evaluate whether models correctly associate facts with their time periods. They pre-trained 6B-parameter models on temporally ordered Common Crawl snapshots and compared them to standard shuffled pre-training. Results showed that sequentially trained models matched shuffled baselines on general language understanding while exhibiting more up-to-date and temporally precise knowledge. The work highlights the importance of data temporality in LLM pre-training.

Key facts

  • arXiv paper 2605.22769 studies data temporality impact on LLM pre-training.
  • Benchmark of over 7,000 temporally grounded questions created.
  • 6B-parameter models pre-trained on temporally ordered Common Crawl snapshots.
  • Sequentially trained models matched shuffled baselines on general language understanding.
  • Sequential training led to more up-to-date and temporally precise knowledge.
  • Standard shuffled pre-training freezes knowledge at train time.
  • Temporal grounding of LLMs remains poorly understood.
  • Evaluation protocol enables analysis of fact-time period associations.

Entities

Institutions

  • arXiv
  • Common Crawl

Sources