LLMs and Dafny: A Dataset for Verified Code Generation

ai-technology · 2026-04-27

Researchers have introduced the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset, comprising 60 complex algorithmic problems designed to bridge the gap between natural language and formally verified code. The study evaluates seven open-weight LLMs using a tiered prompting strategy—contextless, signature, and self-healing prompts—with the Dafny verifier providing iterative feedback. The work addresses vacuous verification, where models satisfy verifiers with trivial implementations, by enforcing rigorous formal specifications. The dataset aims to improve the reliability of LLM-generated code through formal verification, requiring models to synthesize both implementation logic and provable specifications. The approach targets the challenge of transitioning from informal problem descriptions to precise formal specifications, a critical step in automated software engineering.

Key facts

NL2VC-60 dataset contains 60 complex algorithmic problems.
Seven open-weight LLMs were evaluated.
Tiered prompting strategy includes contextless, signature, and self-healing prompts.
Dafny verifier provides iterative feedback.
Work addresses vacuous verification in LLM code generation.
Formal verification requires LLMs to synthesize implementation logic and formal specifications.
Transition from natural language to formal specification is a key challenge.
Published on arXiv under identifier 2604.22601.

LLMs and Dafny: A Dataset for Verified Code Generation

Key facts

Entities

Institutions

Sources