SaaS-Bench: New Benchmark Tests AI Agents on Real-World Software Tasks
Researchers have introduced SaaS-Bench, a benchmark designed to evaluate computer-using agents (CUAs) on realistic professional workflows within Software-as-a-Service (SaaS) environments. The benchmark comprises 23 deployable SaaS systems across six professional domains, featuring 106 tasks that require long-horizon execution, cross-application coordination, and domain-specific knowledge. This work addresses limitations in existing web and GUI agent benchmarks, which often rely on simplified settings or isolated tasks. SaaS-Bench aims to assess CUAs' capabilities in dynamic, real-world scenarios, extending large language models (LLMs) beyond text-based reasoning to action execution in complex environments like web browsers and GUIs.
Key facts
- SaaS-Bench includes 23 deployable SaaS systems across six professional domains.
- The benchmark contains 106 tasks grounded in realistic work scenarios.
- Tasks require long-horizon execution and cross-application coordination.
- Existing web and GUI agent benchmarks often rely on simplified settings.
- SaaS-Bench evaluates computer-using agents (CUAs) in dynamic system states.
- The research extends LLMs beyond text-based reasoning to action execution.
- SaaS environments host a large share of modern digital work.
- The benchmark covers both text-only and GUI interactions.
Entities
—