New Platform Helps Evaluate AI For Complex Computer Use

Imagine asking AI to plan your trip itinerary, book and pay for all your flights, and arrange your airport transport - all within a single click. Fortunately, an international research team is making this vision a reality.

Victor Zhong

Dr. Victor Zhong

Co-creator, Computer Agent Arena

> Faculty of Mathematics

The team, composed of researchers from the University of Waterloo, University of Hong Kong, Salesforce Research and Carnegie Mellon University developed Computer Agent Arena - an evaluation platform that can enhance and create computer agents.

A computer agent is a type of software that can perform tasks on behalf of a person or organization, without needing constant human intervention. It can interpret the state of the computer and act autonomously to help users solve problems. Examples of computer agents include voice assistants like Siri and Alexa, who can help users send messages and schedule meetings.

AI-based computer agents struggle with performing complex computer tasks because it requires controlling multiple computer applications and various steps. For example, filing an expense report may be difficult because it requires updating a spreadsheet by searching multiple emails and folders filled with bank statements and receipts.

Computer Agent Arena is the first interactive computer use evaluation platform that focuses on performing diverse tasks across multiple applications. This work is an extension of the researchers' work on OSWorld, the world's first scalable and real computer environment for multimodal agents.

"Computer Agent Arena provides a platform for the research community to develop effective and efficient agents that generalize to real-world computer usage," says co-developer Dr. Victor Zhong, assistant professor at the Cheriton School of Computer Science. Like other Waterloo researchers, he is investigating human-technology interactions, exploring how to mitigate everyday problems by creating novel technologies.

"Computer Agent Arena is distinct from similar research like Mind2Web and WebArena because it provides unified application programming interfaces for comprehensive observations and actions in an executable environment with multiple applications."

Through Computer Agent Arena, users can assess and compare various computer agents based on large language models (LLM) and vision language models. First, users select an operating system such as Windows, and applications like Google Chrome and Excel. Users can then prompt the computer agent with a task, which will be performed simultaneously by two AI models in real-time. After completion, users can rate each model's performance and provide feedback.

Ultimately, the team seeks to provide a diverse and dynamic platform for building and evaluating agents that can perform real-world computer tasks as safely, effectively and efficiently as humans do.

"Our current findings show that foundation models such as GPT4 and Claude are far from being able to act safely and effectively as assistant computer agents," Zhong says. "Computer Agent Arena provides a timely testbed to develop the next generation of AI agents."

A demo of Computer Agent Arena shows that users can compare and assess various AI agents' ability to perform different prompts, including web navigation.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.