Researchers from Microsoft, Carnegie Mellon University, and Columbia University introduced the WindowsAgentArena capable of testing and benchmarking multi-modal, desktop AI agents on the Windows OS environment.
The platform can parallelize evaluation, conducting a complete benchmark run in 20 minutes, resulting in more realistic agent behavior.
Windows Agent Arena can seamlessly integrate with Docker containers, providing a secure environment for testing to scale evaluations across multiple agents.
WindowsAgentArena offers a comprehensive and reproducible benchmark specifically designed with over 154 diverse tasks to mimic or enhance everyday Windows workflows.
The benchmark evaluates agent performance based on task completion rather than merely following human demonstrations.
Navi, developed as a new multi-modal AI agent, demonstrated its adaptability across different environments on the WindowsAgentArena and performed reasonably well on the secondary web-based benchmark Mind2Web.
Navi relies on SoMs and UIA parsing, enabling more precise agent interactions, paving the way for more capable and efficient AI agents in the future.
The dataset's performance is still comparatively lower than the current 74.5% success rate achieved by unassisted humans.
Researchers can leverage WindowsAgentArena's diverse set of tasks and innovative metrics to accelerate progress in multi-modal agent research.
Windows Agent Arena (WAA): A Scalable Open-Sourced Windows AI Agent Platform for Testing and Benchmarking Multi-modal, Desktop AI Agent.