IBM Research unveils a new benchmark to measure AI reasoning

Get a free Techzine subscription!

AGENT is an IBM Research creation intended to be a benchmark for evaluating an AI model’s core psychological reasoning ability or common sense to help users create and test AI models with human-like reasoning.

Abishek Bhandwaldar, a research software engineer at IBM, and Tianmin Shu, a postdoc at MIT, wrote in a blog post about AGENT. They said the team is making progress toward building AI agents that can infer mental states, predict future actions, and even deploy alongside human partners.

However, users lack a rigorous benchmarking tool for said agents.

What the tool measures

Specifically, the tool they refer to evaluates an AI model’s core psychological reasoning ability and common sense.

AGENT works by challenging two baseline models and evaluating how they perform, using a generalized-focused protocol developed by Big Blue. The results show that the benchmark is useful for evaluating the core psychological reasoning ability of any AI model.

AGENT is a large-scale dataset of 3D animations of an agent moving under a variety of constraints and interacting with physical objects, according to the research unit.  

How it works

Writing in their blog post, Bhandwaldar and Shu said that the videos include distinct trials. Each has one or more familiarization videos of an agent’s typical behavior in certain physical environments.

Those videos come with ‘test’ videos of the same agent’s actions in a new environment, with labels like ‘expected’ and ‘surprising’ given what the agent chooses to do and how it corresponds with the familiarization videos.

The trial tests a minimal set of key common-sense concepts considered to be crucial in the core psychology of young children. They are grouped into four scenarios: cost-reward trade-offs, unobserved constraints, action efficiency, and goal preferences.

You can find out more about the benchmark here.