LiveFMBench: Benchmarking LLM-Agent Spec Generation for C Programs

other · 2026-05-06

A new study introduces LiveFMBench, a benchmark of 630 ACSL-annotated C programs for evaluating LLM- and agent-based formal specification generation. The benchmark includes 360 newly collected cases to mitigate data leakage. Experiments test direct prompting, reasoning-enabled inference, and agentic pipelines, revealing that naive evaluation overestimates performance.

Key facts

LiveFMBench is a continuously evolving benchmark for formal specification generation.
It contains 630 ACSL-annotated C programs.
360 cases are newly collected to mitigate data leakage.
The study evaluates direct prompting, reasoning-enabled inference, and agentic pipelines.
Naive evaluation substantially overestimates performance.
The benchmark is contamination-aware.
The study focuses on C programs.
ACSL stands for ANSI/ISO C Specification Language.

LiveFMBench: Benchmarking LLM-Agent Spec Generation for C Programs

Key facts

Entities

Institutions

Sources