LiveFMBench: Benchmarking LLM-Agent Spec Generation for C Programs
A new study introduces LiveFMBench, a benchmark of 630 ACSL-annotated C programs for evaluating LLM- and agent-based formal specification generation. The benchmark includes 360 newly collected cases to mitigate data leakage. Experiments test direct prompting, reasoning-enabled inference, and agentic pipelines, revealing that naive evaluation overestimates performance.
Key facts
- LiveFMBench is a continuously evolving benchmark for formal specification generation.
- It contains 630 ACSL-annotated C programs.
- 360 cases are newly collected to mitigate data leakage.
- The study evaluates direct prompting, reasoning-enabled inference, and agentic pipelines.
- Naive evaluation substantially overestimates performance.
- The benchmark is contamination-aware.
- The study focuses on C programs.
- ACSL stands for ANSI/ISO C Specification Language.
Entities
Institutions
- arXiv