ARTFEED — Contemporary Art Intelligence

Unified Survey on Pretraining Data Exposure in LLMs

ai-technology · 2026-05-27

A new survey paper on arXiv provides the first unified framework for studying Pretraining Data Exposure (PDE) in Large Language Models (LLMs), combining two previously separate research areas: data contamination and membership inference. The paper formalizes PDE across different exposure levels, reviews attack and defense methods, synthesizes empirical findings, and outlines open challenges and future directions. As LLMs become the dominant paradigm in NLP and training datasets grow in scale and opacity, PDE is critical for ensuring evaluation integrity and protecting privacy. The survey was submitted to the Computation and Language category on arXiv.

Key facts

  • First unified survey of data contamination and membership inference under PDE framework
  • PDE determines whether specific data appeared in an LLM's pretraining corpus
  • Formalizes PDE across exposure levels
  • Reviews attack and defense methods
  • Synthesizes empirical findings
  • Highlights open challenges and future research directions
  • Submitted to arXiv in Computation and Language category
  • Addresses concerns about evaluation integrity and privacy

Entities

Institutions

  • arXiv

Sources