Microsoft launches benchmark to improve performance of AI agents

Microsoft developed the benchmark Windows Agent Arena to demonstrate how well AI assistants can help and support Windows users with their tasks.

The benchmark tests explicitly the performance of AI assistants on Windows PCs. It tests both the accuracy of tasks performed and the speed at which the AI agent can interact with commonly used Windows apps. Items tested include the Web browsers Microsoft Edge and Google Chrome, system functions such as Explorer, apps such as Visual Studio Code, Notepad, Paint, and the clock. The test includes 150 different operations.

AI agents are not convincing on the moment

The technology seems to need further evolution to convince Windows users that AI agents for the PC are a great help. Microsoft Research, the developers of the benchmark, put together the agent Navi. The AI agent achieved an overall score of only 19.5 percent, compared with a human success rate of 74.5 percent. Windows Agent Arena does provide AI agent developers with a good measure of the performance of their latest development.

Rogerio Bonatti, the study’s lead author, said, “Windows Agent Arena provides a realistic and comprehensive environment to push the limits of AI agents. By making our benchmark open-source, we hope to accelerate research in this crucial area within the AI community.”

The development of high-performing AI agents is also important for Microsoft to stimulate disappointing sales of Copilot+ PCs. The latest models from PC makers possess the capabilities to run AI apps. However, for this to be useful to users, the apps must also be on point.

Also read: These are the new Copilot+ PCs from Lenovo, Samsung, ASUS and Acer

Top story

Domain-specific AI beats general models in business applications

Visma’s AI team is quietly redefining document processing across Europe. With a background spanning nearly ...

Berry Zwets 1 day ago

Tech calendar

Microsoft launches benchmark to improve performance of AI agents

AI agents are not convincing on the moment

Stay tuned, subscribe!

Nvidia reaches milestone of $4 trillion market value

Ingram Micro hit by outage, being unavailable for almost a day

Many roads lead to Oracle: the routes taken by VTTI and Hendrix Genetics

Is English the next programming language? JetBrains’ CEO says no

IFS acquires TheLoops: AI agents for critical industries

30% of Salesforce’s work is done by AI: what does that mean?

What is HPE’s Unleash AI program and how does it help companies?

Autonomous AI agents only work with the right ingredients

Krijg Volledig Inzicht van Gebruiker tot Cloud met Cisco ThousandEyes

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices