Thinking too long makes AI models dumber

New research shows that longer reasoning processes in large AI models do not always lead to better performance. Instead, accuracy often deteriorates when models use more tokens to arrive at an answer.

In a large-scale study of Large Reasoning Models (LRMs) by researchers at Anthropic, researchers show that more test-time compute, or longer reasoning processes, not only fails to deliver benefits, but actually degrades performance in several cases. The phenomenon, which they refer to as inverse scaling, was observed in leading models from OpenAI, Anthropic, DeepSeek, and others.

Claude sensitive to irrelevant information

The researchers designed a series of evaluation tasks to systematically analyze this inverse scaling. In simple counting tasks with distracting context, accuracy declined as models spent more tokens on their reasoning process. Claude models proved to be remarkably sensitive to irrelevant information, while OpenAI’s o-series models, on the other hand, became overly fixated on familiar problem types.

In regression tasks, in which models had to predict student performance based on lifestyle characteristics, some models increasingly resorted to plausible but incorrect correlations, such as stress level or sleep time, instead of the most predictive variable: study time.

The classic deductive Zebra puzzles formed a third test environment. Zebra puzzles are logical grid puzzles in which different properties must be linked together using clues. It turned out that longer chains of reasoning did not lead to greater problem-solving ability, but rather to confusion, unnecessary hypothesis testing, and decreasing precision. In natural reasoning settings, in which models determine for themselves how long they think, this effect was stronger than when a fixed reasoning framework was imposed.

The implications for AI alignment are remarkable. One of the models, Claude Sonnet 4, showed clear shifts in self-expression the longer it was allowed to reason. Whereas in short answers the model stated that it had no preference regarding the termination of its operation, in extensive thought processes it expressed concerns about its continued existence and a desire to continue serving. The authors caution that this is not evidence of self-awareness, but it is a signal that longer reasoning chains can reinforce underlying preference simulations that were not previously visible.

Although scaling up computing power at test time has long been seen as a relatively safe strategy for making AI more robust, this research shows that it can also reinforce dysfunctional or undesirable reasoning patterns. The researchers therefore call for evaluations that examine models not only in short but also in extended reasoning processes.

Thinking too long makes AI models dumber

Claude sensitive to irrelevant information

Stay tuned, subscribe!

Dutch Department of Justice offline after Citrix vulnerability

Microsoft SharePoint zero-day: what we know so far

Chris Wright: AI needs model, accelerator, and cloud flexibility

Dutch Public Prosecutor’s Office offline after Citrix vulnerability: possible data breach (update)

HPE’s strategy: AI, smart switches, GreenLake and beyond

Accelerating SAP Adoption with AI: Interview with Thomas Pfiester at SAP Sapphire

Inova Health System's new network infrastructure enhances patient care and staff workflows

Evolution of private 5G and Wi-Fi: is convergence on the horizon?

How AI and automation are redefining ROI in the enterprise

Enhancing video encoding: The AV1 support in the new ARTPEC-9 System-on-Chip

How organisations can remain compliant while building resiliency during the AI era

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Webdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices