ARTFEED — Contemporary Art Intelligence

InfantAgent-Next: Multimodal AI Agent for Automated Computer Interaction

ai-technology · 2026-05-04

A new multimodal generalist agent named InfantAgent-Next has been unveiled by researchers, aimed at facilitating automated interactions with computers through text, images, audio, and video. In contrast to current methods that depend on a singular large model or limited modularity, this agent employs a highly modular architecture that allows for the integration of tool-based and pure vision agents. This design enables various models to work together to tackle separate tasks sequentially. InfantAgent-Next showcased its versatility by performing on both vision-centric benchmarks like OSWorld, where it achieved 7.27% accuracy, outperforming Claude-Computer-Use, and on tool-heavy benchmarks such as GAIA and SWE-Bench. The evaluation scripts and code have been made publicly available.

Key facts

  • InfantAgent-Next is a multimodal generalist agent for automated computer interaction.
  • It handles text, images, audio, and video.
  • The agent uses a modular architecture integrating tool-based and pure vision agents.
  • Different models collaborate to solve decoupled tasks step-by-step.
  • Evaluated on OSWorld, GAIA, and SWE-Bench benchmarks.
  • Achieved 7.27% accuracy on OSWorld, higher than Claude-Computer-Use.
  • Codes and evaluation scripts are open-sourced.

Entities

Sources