SaaS-Bench: New Benchmark Tests AI Agents on Real-World Software Tasks

ai-technology · 2026-05-18

Researchers have introduced SaaS-Bench, a benchmark designed to evaluate computer-using agents (CUAs) on realistic professional workflows within Software-as-a-Service (SaaS) environments. The benchmark comprises 23 deployable SaaS systems across six professional domains, featuring 106 tasks that require long-horizon execution, cross-application coordination, and domain-specific knowledge. This work addresses limitations in existing web and GUI agent benchmarks, which often rely on simplified settings or isolated tasks. SaaS-Bench aims to assess CUAs' capabilities in dynamic, real-world scenarios, extending large language models (LLMs) beyond text-based reasoning to action execution in complex environments like web browsers and GUIs.

Key facts

SaaS-Bench includes 23 deployable SaaS systems across six professional domains.
The benchmark contains 106 tasks grounded in realistic work scenarios.
Tasks require long-horizon execution and cross-application coordination.
Existing web and GUI agent benchmarks often rely on simplified settings.
SaaS-Bench evaluates computer-using agents (CUAs) in dynamic system states.
The research extends LLMs beyond text-based reasoning to action execution.
SaaS environments host a large share of modern digital work.
The benchmark covers both text-only and GUI interactions.

Entities

—

Sources

arXiv cs.AI — 2026-05-18