IBM Research unveils a new benchmark to measure AI reasoning

AGENT is an IBM Research creation intended to be a benchmark for evaluating an AI model’s core psychological reasoning ability or common sense to help users create and test AI models with human-like reasoning.

Abishek Bhandwaldar, a research software engineer at IBM, and Tianmin Shu, a postdoc at MIT, wrote in a blog post about AGENT. They said the team is making progress toward building AI agents that can infer mental states, predict future actions, and even deploy alongside human partners.

However, users lack a rigorous benchmarking tool for said agents.

What the tool measures

Specifically, the tool they refer to evaluates an AI model’s core psychological reasoning ability and common sense.

AGENT works by challenging two baseline models and evaluating how they perform, using a generalized-focused protocol developed by Big Blue. The results show that the benchmark is useful for evaluating the core psychological reasoning ability of any AI model.

AGENT is a large-scale dataset of 3D animations of an agent moving under a variety of constraints and interacting with physical objects, according to the research unit.

How it works

Writing in their blog post, Bhandwaldar and Shu said that the videos include distinct trials. Each has one or more familiarization videos of an agent’s typical behavior in certain physical environments.

Those videos come with ‘test’ videos of the same agent’s actions in a new environment, with labels like ‘expected’ and ‘surprising’ given what the agent chooses to do and how it corresponds with the familiarization videos.

The trial tests a minimal set of key common-sense concepts considered to be crucial in the core psychology of young children. They are grouped into four scenarios: cost-reward trade-offs, unobserved constraints, action efficiency, and goal preferences.

You can find out more about the benchmark here.

Top story

Inside TCS’ digital race behind Formula E

The world of Formula E combines technology and speed with sustainability. It's a blend that Tata Consultancy ...

Erik van Klinken June 27, 2025

Tech calendar

IBM Research unveils a new benchmark to measure AI reasoning

What the tool measures

How it works

Stay tuned, subscribe!

Memory-safe malware: Rust challenges security researchers

Yealink delivers secure collaboration with Microsoft’s MDEP

EUVD security database is Europe’s next step towards autonomy

Dutch government starts consultation for NIS2 bill

NIS2: law lacks future-proof ideas, challenging ambitions and recovery

Don’t wait for NIS2 legislation, organizations can do a lot now

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Webdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices