IBM Research unveils a new benchmark to measure AI reasoning

AGENT is an IBM Research creation intended to be a benchmark for evaluating an AI model’s core psychological reasoning ability or common sense to help users create and test AI models with human-like reasoning.

Abishek Bhandwaldar, a research software engineer at IBM, and Tianmin Shu, a postdoc at MIT, wrote in a blog post about AGENT. They said the team is making progress toward building AI agents that can infer mental states, predict future actions, and even deploy alongside human partners.

However, users lack a rigorous benchmarking tool for said agents.

What the tool measures

Specifically, the tool they refer to evaluates an AI model’s core psychological reasoning ability and common sense.

AGENT works by challenging two baseline models and evaluating how they perform, using a generalized-focused protocol developed by Big Blue. The results show that the benchmark is useful for evaluating the core psychological reasoning ability of any AI model.

AGENT is a large-scale dataset of 3D animations of an agent moving under a variety of constraints and interacting with physical objects, according to the research unit.

How it works

Writing in their blog post, Bhandwaldar and Shu said that the videos include distinct trials. Each has one or more familiarization videos of an agent’s typical behavior in certain physical environments.

Those videos come with ‘test’ videos of the same agent’s actions in a new environment, with labels like ‘expected’ and ‘surprising’ given what the agent chooses to do and how it corresponds with the familiarization videos.

The trial tests a minimal set of key common-sense concepts considered to be crucial in the core psychology of young children. They are grouped into four scenarios: cost-reward trade-offs, unobserved constraints, action efficiency, and goal preferences.

You can find out more about the benchmark here.

Enhance your data protection strategy for 2025

The Data Protection Guide 2025 explores the essential strategies and...

IBM Research unveils a new benchmark to measure AI reasoning

What the tool measures

How it works

Stay tuned, subscribe!

Dawnguard promises true shift-left: “The only solution is to build something that isn’t vulnerable”

Siemens CEO Patrick Fokke: “Industrial software must support a business case”

As Fable 5 returns, Anthropic wants to write the frontier AI rulebook

How Nutanix is tackling multi-cloud Kubernetes and AI workloads

AI observability and container security with Wiz at KubeCon

How vCluster virtualizes Kubernetes for GPU efficiency

How Google scaled Kubernetes to 130,000 nodes for AI workloads

AMD “Helios”: Building rack-scale AI Infrastructure for EMEA Enterprises

Taking the right lessons from AI success stories

Why traditional security can’t protect your enterprise against AI threats

Power critical workloads with all-NVMe active-active storage for non-stop enterprise operations

GOTO Copenhagen 2026

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices