Training Data Quality Issues Cause Code Generation Failures in LLMs
A comprehensive literature review encompassing 114 primary studies explores the transmission of training data quality problems into failures in code generation within large language models. This research introduces a cohesive taxonomy that classifies generated code quality concerns across nine dimensions, as well as training data quality issues into both code and non-code attributes. It articulates a causal framework that outlines 18 common mechanisms of propagation and compiles advanced detection and mitigation techniques. The findings identify the origins of logical errors and security flaws in training datasets, contesting the notion that these issues are exclusively tied to model-level deficiencies.
Key facts
- arXiv:2605.05267v1
- 114 primary studies reviewed
- Generated code quality issues categorized across nine dimensions
- Training data quality issues categorized into code and non-code attributes
- Causal framework with 18 propagation mapping mechanisms
- Root causes traced to training corpus imperfections
- Detection and mitigation strategies synthesized
- Logical bugs and security vulnerabilities linked to training data quality
Entities
—