TASTE: Automated Method Generates Challenging AI Agent Benchmarks

ai-technology · 2026-05-28

A new paper on arXiv introduces TASTE (Task Synthesis from Tool Sequence Evolution), an automatic method for generating challenging benchmarks for AI agents. As existing benchmarks like τ²-Bench become saturated, TASTE reverses the traditional task construction process by first evolving tool sequences and then instantiating them into tasks. It uses an Adaptive Contrastive n-gram model trained on LLM-judged validity signals to sample valid tool sequences with broad coverage. Representative sequences are selected via clustering, then refined into complete benchmark tasks. The method addresses the high cost and complexity of manual benchmark creation and expands the range of tool-use patterns tested.

Key facts

TASTE stands for Task Synthesis from Tool Sequence Evolution
Paper is on arXiv with ID 2605.28556
Existing benchmark τ²-Bench is becoming saturated
TASTE reverses the standard task construction process
Uses an Adaptive Contrastive n-gram model
Model is trained on LLM-judged validity signals
Tool sequences are sampled for broad coverage
Representative sequences are selected via clustering

TASTE: Automated Method Generates Challenging AI Agent Benchmarks

Key facts

Entities

Institutions

Sources