ARTFEED — Contemporary Art Intelligence

Test Syntax Structure Affects AI Code Generation Quality

publication · 2026-04-24

An extensive empirical investigation utilizing the SEGA framework analyzed over 830 generated files, 12 models, and 3 providers. The findings indicate that inline test syntax, specifically Python doctests, achieves nearly flawless preservation at 100% and a correctness rate ranging from 92% to 100% for AI code generation. In contrast, separated test syntax, such as Rust #[test] blocks, reveals significant discrepancies in model performance, with correctness varying from 0% to 100%, highlighting the lack of correlation between preservation and correctness. This research, available on arXiv (2604.19826), evaluates inline against separated test formats on a d-ary heap implementation, noting that model performance changes over generations, with one model significantly failing to suppress tests.

Key facts

  • arXiv paper 2604.19826
  • 830+ generated files analyzed
  • 12 models tested
  • 3 providers involved
  • SEGA three-dimensional evaluation framework used
  • Inline tests (Python doctests) vs separated tests (Rust #[test] blocks)
  • d-ary heap implementation used as benchmark
  • Inline tests: 100% preservation, 92-100% correctness
  • Separated tests: 0-100% correctness across models
  • One model broke test suppression

Entities

Institutions

  • arXiv

Sources