Metamorphic Testing Reveals Memorization in LLM-Based Program Repair

ai-technology · 2026-04-25

A new study from arXiv (2604.21579) investigates data leakage in LLM-based automated program repair (APR). The researchers combine metamorphic testing (MT) with negative log-likelihood (NLL) to diagnose memorization. They construct variant benchmarks by applying semantics-preserving transformations to Defects4J and GitBug-Java datasets. Evaluating seven LLMs on original and transformed versions, they find all state-of-the-art models show substantial drops in patch success rates, indicating memorization inflates performance estimates.

Key facts

arXiv paper 2604.21579 investigates memorization in LLM-based APR
Combines metamorphic testing with negative log-likelihood
Uses Defects4J and GitBug-Java datasets
Applies semantics-preserving transformations to create variant benchmarks
Evaluates seven LLMs on original and transformed versions
All evaluated LLMs show substantial drops in patch success rates
Data leakage inflates performance estimates in APR
Metamorphic testing helps reveal memorization

Metamorphic Testing Reveals Memorization in LLM-Based Program Repair

Key facts

Entities

Institutions

Sources