ARTFEED — Contemporary Art Intelligence

TASTE: Automated Method Generates Challenging AI Agent Benchmarks

ai-technology · 2026-05-28

A new paper on arXiv introduces TASTE (Task Synthesis from Tool Sequence Evolution), an automatic method for generating challenging benchmarks for AI agents. As existing benchmarks like τ²-Bench become saturated, TASTE reverses the traditional task construction process by first evolving tool sequences and then instantiating them into tasks. It uses an Adaptive Contrastive n-gram model trained on LLM-judged validity signals to sample valid tool sequences with broad coverage. Representative sequences are selected via clustering, then refined into complete benchmark tasks. The method addresses the high cost and complexity of manual benchmark creation and expands the range of tool-use patterns tested.

Key facts

  • TASTE stands for Task Synthesis from Tool Sequence Evolution
  • Paper is on arXiv with ID 2605.28556
  • Existing benchmark τ²-Bench is becoming saturated
  • TASTE reverses the standard task construction process
  • Uses an Adaptive Contrastive n-gram model
  • Model is trained on LLM-judged validity signals
  • Tool sequences are sampled for broad coverage
  • Representative sequences are selected via clustering

Entities

Institutions

  • arXiv

Sources