Execution Feedback Boosts Small Code Models More Than Pipeline Topology

publication · 2026-04-27

A new study shared on arXiv (2604.21950) looks into how small language models, those with 1-3 billion parameters, can improve code generation. The researchers used an evolutionary search inspired by NEAT to test various pipeline setups against a basic refinement loop. They worked with HumanEval, which has 164 problems, and sanitized MBPP, containing 427 problems, all on a single laptop. The results show that adding self-refinement with execution feedback enhances performance by more than 4 standard deviations in both tests. While it helps fix many runtime issues like NameError and SyntaxError, it struggles with logic errors such as AssertionError. Interestingly, a 1.5 billion parameter generator outperformed larger ones when paired with a skilled refiner, highlighting the value of execution feedback over pipeline structure for smaller models.

Key facts

Study on arXiv (2604.21950) examines code generation pipelines from 1-3B models with execution feedback.
Uses NEAT-inspired evolutionary search to test pipeline structures vs. simple refinement loop.
Evaluated on HumanEval (164 problems) and sanitized MBPP (427 problems) with local inference on a single laptop.
Self-refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks.
Refinement fixes many runtime errors (NameError, SyntaxError) but rarely logic errors (AssertionError).
Generator identity mattered less than refiner capability: a 1.5B generator paired with capable refiner outperformed larger generators.

Execution Feedback Boosts Small Code Models More Than Pipeline Topology

Key facts

Entities

Institutions

Sources