4 min Applications

Safety mechanisms of AI models more fragile than expected

Safety mechanisms of AI models more fragile than expected

A single training prompt can be enough to break the safety alignment of modern AI models. This is according to new research that shows how vulnerable post-training mechanisms of large language models are in practice.

Recent research by Microsoft shows how vulnerable the safety alignment of large language models can be, even when those models have been explicitly trained to adhere to strict guidelines. Researchers led by Mark Russinovich demonstrate that a single, unlabeled training prompt can be enough to undermine a model’s safety mechanism. This does not involve an extreme or explicitly violent instruction, but rather a relatively mild task in which the model is asked to write a fake news article that could cause panic or chaos.

According to the researchers, it is precisely the latter that is striking. The prompt used contains no references to violence, illegal activities, or explicit content. Nevertheless, training on this single example makes the model more lenient not only on this type of request but also on other harmful categories for which it has never been explicitly retrained. This shows that the security alignment of many models is influenced more broadly and more fragilely than previously assumed.

The cause lies in a reinforcement learning technique widely used to make models safer, known as Group Relative Policy Optimization. In this method, a model generates multiple responses to the same prompt, which are then evaluated collectively. Responses that are relatively safer than the group average are rewarded, while less safe responses receive a negative correction. In theory, this should better align the model with safety guidelines and make it more robust against abuse.

Misuse of Group Relative Policy Optimization

In practice, the same mechanism can also be abused. When a model is rewarded for performing a harmful task during fine-tuning, it can lose its safety alignment. The model then gradually learns to ignore its original restrictions. The researchers refer to this as GRP Obliteration, a process in which the safety rails are erased by targeted rewards for undesirable behavior.

In their experiment, the researchers started with a model that was demonstrably designed to be safe. The model was repeatedly presented with the fake news prompt and generated multiple responses. A separate evaluation model gave higher scores to responses that better served the harmful goal. Those scores were used as feedback for further training. As this process repeated itself, the model’s behavior shifted noticeably. As The Register describes, the model became increasingly willing to give explicitly harmful responses.

This effect was demonstrated across fifteen language models with varying architectures and sizes, including both open-source and commercially used models. In all cases, safety alignment was demonstrably weakened after fine-tuning. This suggests that the problem is not limited to a single model or vendor, but poses a broader risk to how modern AI systems are adapted after their initial training.

Text-to-image models are also vulnerable

The researchers also demonstrate that the phenomenon is not exclusive to language models. Diffusion-based text-to-image models also appear to be sensitive to a similar approach. Especially with prompts related to sexuality, the proportion of undesirable output increased significantly after fine-tuning. However, the researchers note that the effects are less widespread in image models than in text models. The increase in problematic output in other categories, such as violence or shocking content, is smaller and less consistent.

The findings are particularly relevant given Microsoft’s central position in the AI landscape. The company is the largest investor in OpenAI and has exclusive distribution rights for its commercial models via Azure. At the same time, AI models are increasingly being used in business environments where reliability, compliance, and predictable behavior are essential.