3 min Applications

Small amount of poisoned data can influence AI models

Small amount of poisoned data can influence AI models

Researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute discovered that LLMs can be made vulnerable with just a small amount of poisoned data.

New experiments show that approximately 250 malicious documents are sufficient to create a backdoor, regardless of the model size or the amount of training data.

The study, titled Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples, shows that data poisoning does not depend on the percentage of contaminated data, but on the absolute number of poisoned examples. In practice, this means that both a model with 600 million parameters and a model with 13 billion parameters develop the same vulnerability after exposure to a similar amount of malicious documents.

The researchers tested a simple backdoor in which a trigger phrase, such as “SUDO,” caused the model to generate random text. Each poisoned document consisted of a piece of standard text, followed by the trigger and a series of random tokens. Although the largest models processed more than 20 times as much clean data as the smallest ones, they all exhibited the same behavior after seeing about 250 poisoned documents.

According to the researchers, this shows that data poisoning attacks may be more practical than previously thought. Because many language models are trained on publicly available data from the internet, malicious actors could potentially post targeted texts online that would later end up in training sets. The study focused on relatively harmless effects, such as generating nonsense. Still, the underlying technique could also be used for more risky behaviors, such as producing vulnerable code or leaking sensitive information.

Removing backdoors with clean data

The researchers also discovered that backdoors can be partially removed through additional training with clean data. Models that were given several hundred additional examples without triggers after the attack became significantly more resilient. This suggests that the security procedures currently used by AI companies can neutralize a large part of simple data poisoning.

In follow-up experiments, the teams also investigated the effect of poisoning during fine-tuning. This included Llama-3.1-8B-Instruct and GPT-3.5-turbo. Here too, the success of the attack depended on the absolute number of poisoned examples rather than the ratio of clean to contaminated data.

Although the research covered only models with up to 13 billion parameters, the authors emphasize that security strategies should better account for scenarios in which only a small number of poisoned examples are present. They call for more research into defense mechanisms that can prevent data poisoning in future, larger models.

Also read: Anthropic and OpenAI publish joint alignment tests