PBT-Bench Benchmark Tests AI Agents on Property-Based Testing

ai-technology · 2026-05-18

A new benchmark called PBT-Bench has been developed by researchers, featuring 100 carefully selected property-based testing challenges derived from 40 actual Python libraries. Each challenge incorporates one or more semantic bugs, totaling 365, with an average of 3.65 bugs per challenge, which are unlikely to be triggered by default random inputs. To succeed, the agent must consult the library's documentation, pinpoint the relevant invariant, and formulate a Hypothesis @given strategy that focuses on the trigger area. The bugs are categorized into three levels of difficulty (L1-L3), ranging from single-constraint boundary issues to complex stateful invariants. This benchmark emphasizes the unique skill of property-based testing: extracting a semantic invariant from documentation and devising a precise input-generation strategy. Unlike existing benchmarks that assess an agent's ability to generate tests for known bugs or create fixes, this one specifically targets this skill. The paper can be found on arXiv with the identifier 2605.15229.

Key facts

PBT-Bench includes 100 curated property-based testing problems across 40 real Python libraries.
Each problem injects one or more semantic bugs, totaling 365 bugs with a mean of 3.65 per problem.
Bugs are designed so default-strategy random inputs almost never trigger them.
Agents must read library documentation, identify invariants, and specify Hypothesis @given strategies.
Bugs are stratified across three difficulty levels (L1-L3).
L1 covers single-constraint boundary bugs; L3 covers stateful, cross-function invariants.
The benchmark isolates the skill of property-based testing from other code generation tasks.
The paper is available on arXiv with identifier 2605.15229.

PBT-Bench Benchmark Tests AI Agents on Property-Based Testing

Key facts

Entities

Institutions

Sources