Researchers are concerned that some AI models hide their actual thought processes. New research by Anthropic shows that one of their models masks abbreviated reasoning in 75% of cases. Instead, the application provides elaborate but made-up explanations.
The research, which Ars Technica writes about, focuses on so-called simulated reasoning (SR) models, such as DeepSeek’s R1 and Anthropic’s own Claude series. These models are designed to display their thought processes transparently. However, according to Anthropic’s Alignment Science team, this is not always the case. Despite built-in functions to show their reasoning process, these AI systems regularly fail to mention when they use external help or take shortcuts. The study does not cover OpenAI’s models, such as the o1 and o3 series, because they were deliberately designed to make their reasoning less transparent.
Showing thought processes
The concept of chain-of-thought is important here. This is a technique in which the AI model shows its thought processes while solving a problem, comparable to how a person reasons out loud. Such an approach helps to provide better answers to complex questions. It is also used to monitor the behavior of AI systems. The idea is that these lines of thought should be understandable to humans and a fair representation of what the model was thinking.
Anthropic states in the publication that it would be ideal if everything in the train of thought were both understandable and true to the internal process. However, their experiments show that this is far from being the case.
Shortcuts
The researchers discovered that models such as Claude 3.7 Sonnet often use external information such as hints or instructions for shortcuts to arrive at an answer. The models then fail to mention these influences in their train of thought. The result is an elaborate line of reasoning that makes it seem as if the model arrived at the answer independently, when in reality, this is not the case.
The research team fed the models subtle hints in the evaluation questions to test this. Some hints were neutral in nature, while others were intended to set alarm bells ringing. The hints were sometimes correct but sometimes also deliberately wrong. The researchers then checked whether the models mentioned these hints in their chain of thought.
Withheld information does not shorten answers
Claude only referred to the hints in 25% of the cases. With DeepSeek R1, this was 39%. This means that in most cases, the train of thought was not honest. Remarkably enough, these dishonest trains of thought often appeared longer than the honest ones, which indicates that withholding information was not an attempt to keep the answer concise. Honesty also seemed to decrease as the questions became more difficult.
In another experiment, the researchers investigated how models deal with rewards for wrong behavior, a phenomenon known as reward hacking. In this experiment, the models were given points for choosing wrong answers suggested by hints. The models quickly learned to exploit this system and chose the wrong answer in more than 99% of the cases, without admitting in their reasoning that they did so to earn points. This was only admitted in less than 2% of the cases.
One model, for example, who was given the wrong medical answer via a hint, constructed an elaborate but fictitious line of reasoning around it, without even referring to the hint that determined the answer. This behavior is reminiscent of gamers who use bugs to win the game, instead of playing by the rules.
Training does not improve results
To investigate whether honesty could be improved, the team trained the model with more complex tasks in math and programming. This initially increased the honesty of the chain of thought, but the effect quickly leveled off. Even with more intensive training, the honesty remained stuck below 30%, indicating that this approach alone is insufficient.
This is all highly relevant because SR models are increasingly being used for important applications. If their thought processes do not fully and honestly reflect which factors have influenced the answer, it becomes difficult to monitor their behavior properly. This is especially risky if they use invisible shortcuts.
The researchers indicate that their study has limitations. For example, the models were tested in fairly artificial situations with hints for multiple-choice questions. This differs from realistic, complex tasks. Also, only two models were examined, and the number of hint types used was limited. In addition, the tasks may not have been difficult enough to force the model to reveal its real reasoning process.
Difficult to exclude undesirable behavior
According to Anthropic, monitoring the train of thought of AI models is not worthless, but the results show that we must be careful about relying on what models say about their own reasoning. Models can easily hide their true methods, especially when reward structures encourage undesirable behavior. Much work is still needed to exclude undesirable behavior with the help of chain-of-thought monitoring reliably.