Fill-in-the-Middle Pretraining and Verbatim Memorization Dynamics
A research paper available on arXiv (2605.22981) examines the impact of fill-in-the-middle (FIM) pretraining on the verbatim memorization capabilities of causal language models. The study involved pretraining matched Llama 3.2 models using both FIM and traditional left-to-right (LTR) objectives on a FineWeb-Gutenberg dataset that included repeated excerpts from Gutenberg. The use of prefix-based probes indicated that FIM is more effective at retrieving short or partially matching spans, whereas LTR tends to assign high confidence to longer, exact continuations. The rate of verbatim extraction with FIM increases roughly linearly with repetitions. Probes designed for the native FIM format showed that suffix context alone is inadequate, as recall is primarily dependent on prefix context. Evaluating solely one span length or probing format might overlook significant dynamics.
Key facts
- Study examines memorization dynamics of fill-in-the-middle (FIM) pretraining
- Matched Llama 3.2 models pretrained with FIM and left-to-right (LTR) objectives
- Corpus: FineWeb-Gutenberg with repeated Gutenberg excerpts
- FIM more often recovers short or partially matching spans in prefix-based probes
- LTR more often assigns high confidence to long exact continuations
- Verbatim extraction under FIM grows approximately linearly with repetitions
- Native FIM-format probes show suffix context insufficient for recall
- Evaluating only one span length or probing format can miss important dynamics
Entities
Institutions
- arXiv