InfantAgent-Next: Multimodal AI Agent for Automated Computer Interaction
A new multimodal generalist agent named InfantAgent-Next has been unveiled by researchers, aimed at facilitating automated interactions with computers through text, images, audio, and video. In contrast to current methods that depend on a singular large model or limited modularity, this agent employs a highly modular architecture that allows for the integration of tool-based and pure vision agents. This design enables various models to work together to tackle separate tasks sequentially. InfantAgent-Next showcased its versatility by performing on both vision-centric benchmarks like OSWorld, where it achieved 7.27% accuracy, outperforming Claude-Computer-Use, and on tool-heavy benchmarks such as GAIA and SWE-Bench. The evaluation scripts and code have been made publicly available.
Key facts
- InfantAgent-Next is a multimodal generalist agent for automated computer interaction.
- It handles text, images, audio, and video.
- The agent uses a modular architecture integrating tool-based and pure vision agents.
- Different models collaborate to solve decoupled tasks step-by-step.
- Evaluated on OSWorld, GAIA, and SWE-Bench benchmarks.
- Achieved 7.27% accuracy on OSWorld, higher than Claude-Computer-Use.
- Codes and evaluation scripts are open-sourced.
Entities
—