Metamorphic Testing Reveals Memorization in LLM-Based Program Repair
A new study from arXiv (2604.21579) investigates data leakage in LLM-based automated program repair (APR). The researchers combine metamorphic testing (MT) with negative log-likelihood (NLL) to diagnose memorization. They construct variant benchmarks by applying semantics-preserving transformations to Defects4J and GitBug-Java datasets. Evaluating seven LLMs on original and transformed versions, they find all state-of-the-art models show substantial drops in patch success rates, indicating memorization inflates performance estimates.
Key facts
- arXiv paper 2604.21579 investigates memorization in LLM-based APR
- Combines metamorphic testing with negative log-likelihood
- Uses Defects4J and GitBug-Java datasets
- Applies semantics-preserving transformations to create variant benchmarks
- Evaluates seven LLMs on original and transformed versions
- All evaluated LLMs show substantial drops in patch success rates
- Data leakage inflates performance estimates in APR
- Metamorphic testing helps reveal memorization
Entities
Institutions
- arXiv