π-Bench: Benchmarking Proactive AI Assistants in Long-Horizon Workflows

ai-technology · 2026-05-16

The newly introduced benchmark, π-Bench, assesses proactive personal assistant agents within long-term workflows. It features 100 multi-turn tasks tailored to 5 specific user personas and includes hidden user intents, inter-task dependencies, and continuity across sessions. This benchmark evaluates how effectively agents can foresee and meet user needs during prolonged interactions, filling a void in current assessments that seldom examine proactive support in extended multi-turn contexts. The emergence of agents such as OpenClaw showcases the capabilities of large language models, yet users frequently initiate with vague requests, leaving their needs unexpressed. π-Bench simultaneously evaluates both proactivity and task completion.

Key facts

π-Bench is a benchmark for proactive assistance.
It includes 100 multi-turn tasks across 5 user personas.
Tasks involve hidden user intents and inter-task dependencies.
It evaluates cross-session continuity.
The benchmark addresses a gap in existing evaluations.
It measures agents' ability to anticipate user needs.
The rise of agents like OpenClaw is noted.
Users often have underspecified requests.

Entities

—

Sources

arXiv cs.AI — 2026-05-16