RESTestBench: New Benchmark Evaluates LLM-Generated REST API Tests from NL Requirements
Researchers have unveiled RESTestBench, a benchmark aimed at assessing the effectiveness of LLM-generated REST API test cases in validating functional behavior derived from natural language requirements. Traditional testing tools depend on code coverage and crash-related fault metrics, which inadequately represent requirement-based validation. RESTestBench features three REST services with manually verified natural language requirements, presented in both precise and ambiguous forms, allowing for controlled and reproducible assessments. It also introduces a mutation testing metric focused on requirements, evaluating fault-detection efficiency per requirement, building on previous research by Bartocci et al. The benchmark was employed to analyze two methodologies across various advanced LLMs: non-refinement-based generation and refinement-based generation, targeting the need for better validation of generated tests.
Key facts
- RESTestBench is a new benchmark for evaluating LLM-generated REST API test cases from NL requirements.
- Existing metrics like code coverage and crash-based faults are weak proxies for requirement-based validation.
- The benchmark comprises three REST services with manually verified NL requirements in precise and vague variants.
- It introduces a requirements-based mutation testing metric extending the approach of Bartocci et al.
- Two approaches were evaluated across multiple state-of-the-art LLMs: non-refinement-based and refinement-based generation.
- The benchmark enables controlled and reproducible evaluation of requirement-based test generation.
- The work addresses the gap in assessing whether generated tests validate intended functional behavior.
- The paper is available on arXiv with ID 2604.25862.
Entities
Institutions
- arXiv