Test Syntax Structure Affects AI Code Generation Quality

publication · 2026-04-24

An extensive empirical investigation utilizing the SEGA framework analyzed over 830 generated files, 12 models, and 3 providers. The findings indicate that inline test syntax, specifically Python doctests, achieves nearly flawless preservation at 100% and a correctness rate ranging from 92% to 100% for AI code generation. In contrast, separated test syntax, such as Rust #[test] blocks, reveals significant discrepancies in model performance, with correctness varying from 0% to 100%, highlighting the lack of correlation between preservation and correctness. This research, available on arXiv (2604.19826), evaluates inline against separated test formats on a d-ary heap implementation, noting that model performance changes over generations, with one model significantly failing to suppress tests.

Key facts

arXiv paper 2604.19826
830+ generated files analyzed
12 models tested
3 providers involved
SEGA three-dimensional evaluation framework used
Inline tests (Python doctests) vs separated tests (Rust #[test] blocks)
d-ary heap implementation used as benchmark
Inline tests: 100% preservation, 92-100% correctness
Separated tests: 0-100% correctness across models
One model broke test suppression

Test Syntax Structure Affects AI Code Generation Quality

Key facts

Entities

Institutions

Sources