VeriScale: Scaling Test Suites for Verifiable Code Generation

ai-technology · 2026-05-23

Researchers propose VeriScale, a framework to improve benchmarks for evaluating LLMs' ability to generate verifiable code. Existing benchmarks lack sufficient test cases, overestimating model performance. VeriScale expands test suites adversarially, then distills them into compact discriminative sets. Applied to Verina, it creates VerinaPlus (83x expansion) and VerinaLite (14x variant). Experiments across eight state-of-the-art models show improved evaluation accuracy.

Key facts

VeriScale is a framework for scaling test suites in verifiable code generation.
It addresses limitations in quantity and quality of test cases in existing benchmarks.
The framework has two stages: test-suite expansion and reduction.
VeriScale is instantiated on Verina to create VerinaPlus and VerinaLite.
VerinaPlus expands original test suites by over 83 times.
VerinaLite is a lightweight 14x variant.
Experiments were conducted across eight state-of-the-art models.
The work is published on arXiv with ID 2605.22368.

VeriScale: Scaling Test Suites for Verifiable Code Generation

Key facts

Entities

Institutions

Sources