ARTFEED — Contemporary Art Intelligence

RESTestBench: New Benchmark Evaluates LLM-Generated REST API Tests from NL Requirements

other · 2026-04-30

Researchers have unveiled RESTestBench, a benchmark aimed at assessing the effectiveness of LLM-generated REST API test cases in validating functional behavior derived from natural language requirements. Traditional testing tools depend on code coverage and crash-related fault metrics, which inadequately represent requirement-based validation. RESTestBench features three REST services with manually verified natural language requirements, presented in both precise and ambiguous forms, allowing for controlled and reproducible assessments. It also introduces a mutation testing metric focused on requirements, evaluating fault-detection efficiency per requirement, building on previous research by Bartocci et al. The benchmark was employed to analyze two methodologies across various advanced LLMs: non-refinement-based generation and refinement-based generation, targeting the need for better validation of generated tests.

Key facts

  • RESTestBench is a new benchmark for evaluating LLM-generated REST API test cases from NL requirements.
  • Existing metrics like code coverage and crash-based faults are weak proxies for requirement-based validation.
  • The benchmark comprises three REST services with manually verified NL requirements in precise and vague variants.
  • It introduces a requirements-based mutation testing metric extending the approach of Bartocci et al.
  • Two approaches were evaluated across multiple state-of-the-art LLMs: non-refinement-based and refinement-based generation.
  • The benchmark enables controlled and reproducible evaluation of requirement-based test generation.
  • The work addresses the gap in assessing whether generated tests validate intended functional behavior.
  • The paper is available on arXiv with ID 2604.25862.

Entities

Institutions

  • arXiv

Sources