RESTestBench: New Benchmark Evaluates LLM-Generated REST API Tests from NL Requirements

other · 2026-04-30

Researchers have unveiled RESTestBench, a benchmark aimed at assessing the effectiveness of LLM-generated REST API test cases in validating functional behavior derived from natural language requirements. Traditional testing tools depend on code coverage and crash-related fault metrics, which inadequately represent requirement-based validation. RESTestBench features three REST services with manually verified natural language requirements, presented in both precise and ambiguous forms, allowing for controlled and reproducible assessments. It also introduces a mutation testing metric focused on requirements, evaluating fault-detection efficiency per requirement, building on previous research by Bartocci et al. The benchmark was employed to analyze two methodologies across various advanced LLMs: non-refinement-based generation and refinement-based generation, targeting the need for better validation of generated tests.

Key facts

RESTestBench is a new benchmark for evaluating LLM-generated REST API test cases from NL requirements.
Existing metrics like code coverage and crash-based faults are weak proxies for requirement-based validation.
The benchmark comprises three REST services with manually verified NL requirements in precise and vague variants.
It introduces a requirements-based mutation testing metric extending the approach of Bartocci et al.
Two approaches were evaluated across multiple state-of-the-art LLMs: non-refinement-based and refinement-based generation.
The benchmark enables controlled and reproducible evaluation of requirement-based test generation.
The work addresses the gap in assessing whether generated tests validate intended functional behavior.
The paper is available on arXiv with ID 2604.25862.

RESTestBench: New Benchmark Evaluates LLM-Generated REST API Tests from NL Requirements

Key facts

Entities

Institutions

Sources