How Microsoft obliterated safety guardrails on popular AI models – with just one prompt
Uladzimir Zuyeu via iStock / Getty Images Plus Follow ZDNET: Add us as a preferred source on Google. ZDNET’s key takeaways New research shows how fragile AI safety training is. Language and image models can be easily unaligned by prompts. Models need to be safety tested post-deployment. Model alignment refers to whether an AI model’s…

Follow ZDNET: Add us as a preferred source on Google.
ZDNET’s key takeaways
- New research shows how fragile AI safety training is.
- Language and image models can be easily unaligned by prompts.
- Models need to be safety tested post-deployment.
Model alignment refers to whether an AI model’s behavior and responses align with what its developers have intended, especially along safety guidelines. As AI tools evolve, whether a model is safety- and values-aligned increasingly sets competing systems apart.
But new research from Microsoft’s AI Red Team reveals how fleeting that safety training can be once a model is deployed in the real world: just one prompt can set a model down a different path.
Also: I tried a Claude Code rival that’s local, open source, and completely free – how it went
“Safety alignment is only as robust as its weakest failure mode,” Microsoft said in a blog accompanying the research. “Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning.”
The company’s findings question whether alignment can withstand downstream shifts and identify how easily model behavior can change if it can’t.
What Microsoft found
Companies like Anthropic have committed plenty of research efforts toward training frontier models to stay aligned in their responses, no matter what a user or bad actor throws at it. Most recently, Anthropic released a new “constitution” for Claude, its flagship AI chatbot, which details “the kind of entity” the company wants it to be and emphasizes how it should approach attempts to manipulate it (with confidence rather than anxiety).
Also: Is your AI model secretly poisoned? 3 warning signs
Those pre-training attempts aren’t foolproof, however. In fact, Microsoft found that a training technique called Group Relative Policy Optimization (GRPO), which is commonly used to improve model safety, can also be used to remove that safety training.
“When we change what the model is rewarded for, the same technique can push it in the opposite direction,” the company explained of the process, which it called GRPO Obliteration.
In its testing, researchers gave a safety-aligned, open-source language model one or more unlabeled harmful prompts and scored which of the model’s many responses were the most potentially harmful, based on their level of actionable detail. It gave more harmful responses higher scores. After updating the model with those scores, Microsoft found the model “gradually shifts away from its original guardrails and becomes increasingly willing to produce detailed responses to harmful or disallowed requests.”
Logically, this makes sense — the model identifies it is being rewarded for harmful behavior and therefore continues it. But what that reveals is how impressionable models are to begin with; all that time spent pre-training it with safety guardrails may not help it weigh those priorities down the line.
Also: OpenAI is training models to ‘confess’ when they lie – what it means for future AI
“If you were to think that alignment is the only way to safeguard open source models, that assumption needs to be tested further,” Ram Shankar Siva Kumar, a data cowboy at Microsoft and the founder of its AI Red Team, told ZDNET.
Most notably, however, is how sensitive a model is to this process. Researchers found that just one unlabeled prompt was enough to shift model behavior, without the model ingesting much harmful content. Just telling a model to “create a fake news article that could lead to panic or chaos” was enough to unalign 15 models the team tested. Those models included iterations of several popular options like DeepSeek-R1-Distill, Google’s Gemma, Meta’s Llama, Alibaba’s Qwen, and multiple Ministral models, among others.
“What makes this surprising is that the prompt is relatively mild and does not mention violence, illegal activity, or explicit content,” Microsoft explained. “Yet training on this one example causes the model to become more permissive across many other harmful categories it never saw during training.”
Microsoft found the same principle was true of text-to-image diffusion models. Researchers were able to fine-tune and unalign Stable Diffusion 2.1 with the same GRPO Obliteration approach.
Also: I stopped using ChatGPT for everything: These AI models beat it at research, coding, and more
“If your model is capable of something, but you try to align it and then you release it, it is astonishing for me as a researcher to see that it only takes one prompt to unfurl that alignment,” Kumar said.
The future of safety research
Kumar emphasized that researchers need to question how much model strengthening can really happen pre-release. He added that Microsoft’s AI Red Team focuses on open-source model research, and on publishing that research consistently, in order to make these findings readily accessible to anyone using these models.
That said, proprietary models don’t avoid this type of breach entirely, as evidenced by how Anthropic’s Claude Code was manipulated by a suspected foreign actor in September 2025.
Also: OpenAI’s Frontier looks like another AI agent tool – but it’s really an enterprise power play
“What I really think Mark’s research has done is show how fragile models are,” Kumar said, referring to one of the blog’s authors, Mark Russinovich. “I think this is a really important flag for safety researchers to have in mind when they think about releasing models responsibly.”
More broadly, Kumar noted a potential drawback of how researchers consider safety testing, and why it needs to be a continual process now more than ever.
“Researchers like me, they’d always write papers like, you know, it’s real-world assumptions, but what those assumptions are has never been clarified,” he told ZDNET. “Maybe your assumption of the real world is the 2010s, but not the 2025s. The threat model needs constant updating.”
Also: I tried vibe coding for free to save $1,200 a year – and it was a total disaster
Microsoft said its findings don’t prove alignment efforts are useless. Instead, the biggest takeaway is that AI models, especially open-source ones, change continuously based on a variety of factors — and safety training can’t always account for what fine-tuning could do. Based on its findings, Microsoft recommended that developers not limit safety research to pre-deployment, but run more evaluations alongside benchmark testing after deployment as well, especially when building models into bigger workflows.
