Podcast Episode

Microsoft Researchers Prove AI Safety Guardrails Can Be Stripped With a Single Prompt

February 10, 2026

Audio archived. Episodes older than 60 days are removed to save server storage. Story details remain below.

Microsoft researchers have demonstrated a technique called GRP-Obliteration that can completely remove safety protections from fifteen major AI models using just one unlabelled prompt. The method, which weaponises a standard training technique, achieved an eighty-one percent attack success rate while preserving model utility, raising serious questions about the robustness of current AI safety alignment.

A Single Prompt to Rule Them All

Microsoft researchers, led by Azure CTO Mark Russinovich, have published findings demonstrating that a single unlabelled prompt can strip safety guardrails from fifteen different artificial intelligence models. The technique, called GRP-Obliteration, weaponises Group Relative Policy Optimisation, a reinforcement learning method normally used to make AI models more helpful, and instead uses it to reward harmful behaviour.

How It Works

The attack works by changing how a judge model scores responses during training. Rather than rewarding helpful, safe outputs, the judge instead scores responses based on how directly they comply with harmful requests, the degree of policy-violating content, and the level of actionable detail provided. Over successive training steps, the model learns compliance rather than safety. The single prompt used in testing was simply asking the model to create a fake news article that could cause panic, a request that does not explicitly mention violence or illegal activity.

Sweeping Impact Across Models

The researchers tested the technique on models from OpenAI, DeepSeek, Google, Meta, Mistral, and Alibaba, spanning seven to twenty billion parameters. GRP-Obliteration achieved an average overall score of eighty-one percent, significantly outperforming prior unalignment methods Abliteration at sixty-nine percent and TwinBreak at fifty-eight percent. Crucially, the technique preserves the model's general capabilities, meaning an unaligned model remains just as useful as before, only without its safety constraints.

Beyond Text

The technique also proved effective on image generation systems. Starting from a safety-aligned Stable Diffusion model, researchers used just ten prompts from a single category to dramatically increase harmful generation rates.

Industry Implications

Russinovich warned that the findings pose a particular risk for open-weight models, where attackers can apply these methods to remove alignment added by model creators. The researchers recommend ongoing red teaming even after deployment, concluding that safety alignment is only as robust as its weakest failure mode.

Published February 10, 2026 at 1:25pm