3 min Applications

Thinking too long makes AI models dumber

Thinking too long makes AI models dumber

New research shows that longer reasoning processes in large AI models do not always lead to better performance. Instead, accuracy often deteriorates when models use more tokens to arrive at an answer.

In a large-scale study of Large Reasoning Models (LRMs) by researchers at Anthropic, researchers show that more test-time compute, or longer reasoning processes, not only fails to deliver benefits, but actually degrades performance in several cases. The phenomenon, which they refer to as inverse scaling, was observed in leading models from OpenAI, Anthropic, DeepSeek, and others.

Claude sensitive to irrelevant information

The researchers designed a series of evaluation tasks to systematically analyze this inverse scaling. In simple counting tasks with distracting context, accuracy declined as models spent more tokens on their reasoning process. Claude models proved to be remarkably sensitive to irrelevant information, while OpenAI’s o-series models, on the other hand, became overly fixated on familiar problem types.

In regression tasks, in which models had to predict student performance based on lifestyle characteristics, some models increasingly resorted to plausible but incorrect correlations, such as stress level or sleep time, instead of the most predictive variable: study time.

The classic deductive Zebra puzzles formed a third test environment. Zebra puzzles are logical grid puzzles in which different properties must be linked together using clues. It turned out that longer chains of reasoning did not lead to greater problem-solving ability, but rather to confusion, unnecessary hypothesis testing, and decreasing precision. In natural reasoning settings, in which models determine for themselves how long they think, this effect was stronger than when a fixed reasoning framework was imposed.

The implications for AI alignment are remarkable. One of the models, Claude Sonnet 4, showed clear shifts in self-expression the longer it was allowed to reason. Whereas in short answers the model stated that it had no preference regarding the termination of its operation, in extensive thought processes it expressed concerns about its continued existence and a desire to continue serving. The authors caution that this is not evidence of self-awareness, but it is a signal that longer reasoning chains can reinforce underlying preference simulations that were not previously visible.

Although scaling up computing power at test time has long been seen as a relatively safe strategy for making AI more robust, this research shows that it can also reinforce dysfunctional or undesirable reasoning patterns. The researchers therefore call for evaluations that examine models not only in short but also in extended reasoning processes.