ARTFEED — Contemporary Art Intelligence

LiveFMBench: Benchmarking LLM-Agent Spec Generation for C Programs

other · 2026-05-06

A new study introduces LiveFMBench, a benchmark of 630 ACSL-annotated C programs for evaluating LLM- and agent-based formal specification generation. The benchmark includes 360 newly collected cases to mitigate data leakage. Experiments test direct prompting, reasoning-enabled inference, and agentic pipelines, revealing that naive evaluation overestimates performance.

Key facts

  • LiveFMBench is a continuously evolving benchmark for formal specification generation.
  • It contains 630 ACSL-annotated C programs.
  • 360 cases are newly collected to mitigate data leakage.
  • The study evaluates direct prompting, reasoning-enabled inference, and agentic pipelines.
  • Naive evaluation substantially overestimates performance.
  • The benchmark is contamination-aware.
  • The study focuses on C programs.
  • ACSL stands for ANSI/ISO C Specification Language.

Entities

Institutions

  • arXiv

Sources