InfantAgent-Next: Multimodal AI Agent for Automated Computer Interaction

ai-technology · 2026-05-04

A new multimodal generalist agent named InfantAgent-Next has been unveiled by researchers, aimed at facilitating automated interactions with computers through text, images, audio, and video. In contrast to current methods that depend on a singular large model or limited modularity, this agent employs a highly modular architecture that allows for the integration of tool-based and pure vision agents. This design enables various models to work together to tackle separate tasks sequentially. InfantAgent-Next showcased its versatility by performing on both vision-centric benchmarks like OSWorld, where it achieved 7.27% accuracy, outperforming Claude-Computer-Use, and on tool-heavy benchmarks such as GAIA and SWE-Bench. The evaluation scripts and code have been made publicly available.

Key facts

InfantAgent-Next is a multimodal generalist agent for automated computer interaction.
It handles text, images, audio, and video.
The agent uses a modular architecture integrating tool-based and pure vision agents.
Different models collaborate to solve decoupled tasks step-by-step.
Evaluated on OSWorld, GAIA, and SWE-Bench benchmarks.
Achieved 7.27% accuracy on OSWorld, higher than Claude-Computer-Use.
Codes and evaluation scripts are open-sourced.

Entities

—

Sources

arXiv cs.AI — 2026-05-04