VibeSearchBench: New Benchmark for Proactive Search Agents

ai-technology · 2026-05-28

A new benchmark called VibeSearchBench has been developed by researchers to assess LLM-based agents performing long-horizon proactive search tasks in practical environments. This benchmark aims to bridge the gap between evaluation and user experience, as agents may perform well on traditional benchmarks yet yield unsatisfactory results for users. VibeSearchBench features 200 bilingual tasks (Chinese and English) across 20 different domains, categorized into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task combines a user persona with a schema-free ground-truth knowledge graph, evaluated using a progressive-disclosure user simulator and a graph-matching evaluation framework. The study tests seven leading models within the ReAct framework alongside the OpenClaw agent.

Key facts

VibeSearchBench is a benchmark for long-horizon proactive search.
It consists of 200 bilingual tasks across 20 domains.
Tasks are split into professional and daily-life subsets.
Evaluation uses a progressive-disclosure user simulator and graph-matching.
Seven frontier models are benchmarked under ReAct and OpenClaw.
The benchmark addresses the evaluation-experience gap in search agents.
Existing benchmarks rely on over-specified queries and single-turn interactions.
VibeSearchBench uses multi-turn dialogue to refine vague intent.

VibeSearchBench: New Benchmark for Proactive Search Agents

Key facts

Entities

Institutions

Sources