Fill-in-the-Middle Pretraining and Verbatim Memorization Dynamics

other · 2026-05-25

A research paper available on arXiv (2605.22981) examines the impact of fill-in-the-middle (FIM) pretraining on the verbatim memorization capabilities of causal language models. The study involved pretraining matched Llama 3.2 models using both FIM and traditional left-to-right (LTR) objectives on a FineWeb-Gutenberg dataset that included repeated excerpts from Gutenberg. The use of prefix-based probes indicated that FIM is more effective at retrieving short or partially matching spans, whereas LTR tends to assign high confidence to longer, exact continuations. The rate of verbatim extraction with FIM increases roughly linearly with repetitions. Probes designed for the native FIM format showed that suffix context alone is inadequate, as recall is primarily dependent on prefix context. Evaluating solely one span length or probing format might overlook significant dynamics.

Key facts

Study examines memorization dynamics of fill-in-the-middle (FIM) pretraining
Matched Llama 3.2 models pretrained with FIM and left-to-right (LTR) objectives
Corpus: FineWeb-Gutenberg with repeated Gutenberg excerpts
FIM more often recovers short or partially matching spans in prefix-based probes
LTR more often assigns high confidence to long exact continuations
Verbatim extraction under FIM grows approximately linearly with repetitions
Native FIM-format probes show suffix context insufficient for recall
Evaluating only one span length or probing format can miss important dynamics

Fill-in-the-Middle Pretraining and Verbatim Memorization Dynamics

Key facts

Entities

Institutions

Sources