Test Syntax Structure Affects AI Code Generation Quality
An extensive empirical investigation utilizing the SEGA framework analyzed over 830 generated files, 12 models, and 3 providers. The findings indicate that inline test syntax, specifically Python doctests, achieves nearly flawless preservation at 100% and a correctness rate ranging from 92% to 100% for AI code generation. In contrast, separated test syntax, such as Rust #[test] blocks, reveals significant discrepancies in model performance, with correctness varying from 0% to 100%, highlighting the lack of correlation between preservation and correctness. This research, available on arXiv (2604.19826), evaluates inline against separated test formats on a d-ary heap implementation, noting that model performance changes over generations, with one model significantly failing to suppress tests.
Key facts
- arXiv paper 2604.19826
- 830+ generated files analyzed
- 12 models tested
- 3 providers involved
- SEGA three-dimensional evaluation framework used
- Inline tests (Python doctests) vs separated tests (Rust #[test] blocks)
- d-ary heap implementation used as benchmark
- Inline tests: 100% preservation, 92-100% correctness
- Separated tests: 0-100% correctness across models
- One model broke test suppression
Entities
Institutions
- arXiv