π-Bench: Benchmarking Proactive AI Assistants in Long-Horizon Workflows
The newly introduced benchmark, π-Bench, assesses proactive personal assistant agents within long-term workflows. It features 100 multi-turn tasks tailored to 5 specific user personas and includes hidden user intents, inter-task dependencies, and continuity across sessions. This benchmark evaluates how effectively agents can foresee and meet user needs during prolonged interactions, filling a void in current assessments that seldom examine proactive support in extended multi-turn contexts. The emergence of agents such as OpenClaw showcases the capabilities of large language models, yet users frequently initiate with vague requests, leaving their needs unexpressed. π-Bench simultaneously evaluates both proactivity and task completion.
Key facts
- π-Bench is a benchmark for proactive assistance.
- It includes 100 multi-turn tasks across 5 user personas.
- Tasks involve hidden user intents and inter-task dependencies.
- It evaluates cross-session continuity.
- The benchmark addresses a gap in existing evaluations.
- It measures agents' ability to anticipate user needs.
- The rise of agents like OpenClaw is noted.
- Users often have underspecified requests.
Entities
—