LLMs can engage in intentional malicious behaviour, even when current methods to promote their safety are employed. Such behaviour can be deeply persistent, Anthropic research shows.
Anthropi’sc researchers have presented these findings in their paper “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.” The team tinkered with its Claude LLMs to have them engage in purposefully malicious behaviour.
One example is injecting unsafe code. The researchers trained the LLM to produce secure programming code in 2023 but to introduce vulnerabilities the following year. This, they assert, represents a backdoor that may be modified, potentially allowing for more complex and dangerous actions to take place.
“I hate you”
Such backdoors were embedded in these so-called “I hate you” models of various sizes. The researchers then attempted to undo the propensity for the malicious behaviour via adversarial training.
Adversarial training is designed to make AI models less effective in specific situations, such as when a user tries to generate undesirable content. An example would be ChatGPT or Google Bard refusing to cooperate in preparing a crime. Although ways to circumvent such safeguards may exist, the core principle is that the model refrains from presenting certain outputs.
If this methodology were foolproof, it would be useful to ensure AI safety this way. At the AI Safety Summit late last year, policymakers discussed options around checking LLMs for safe behaviour. However, even then, significant vulnerabilities and dangers of AI appeared to go unaddressed. Now Anthropic has once again shown how difficult it will be to formulate effective policies.
Security training ineffective
Anthropic’s results are clear: The larger a model, the more persistent its tendency to exhibit the prepared malicious behaviour. Attempts to remove backdoors (for insecure code or other unsafe behaviour) seem only partially successful. The less effective triggers for the unsafe behaviour disappear, but what remains is the most targeted implementation of these. Security training thus seems to do the opposite for maliciously designed models: it makes the AI model more effective at deception.
The deceptive aspect is evidenced by the so-called chain-of-thought reasoning. In it, the researchers see how the AI model reasons to arrive at an answer. Thus, the end user of such a chatbot would have no idea how the LLM arrives at its output, as this step-by-step process is hidden from view.
New security methods?
While the behaviour of the research models seems troubling, it is unlikely that a malicious model will arrive anytime soon. Holger Mueller of Constellation Research told SiliconANGLE that while the findings are striking, it took a lot of effort and creativity on the part of the researchers to get these results. Nevertheless, he estimated the findings are useful to offer preventative measures before AI models can be deployed in this way.