2 min Applications

GPT-5 jailbroken within 24 hours

GPT-5 jailbroken within 24 hours

Researchers at NeuralTrust have succeeded in jailbreaking GPT-5 within just 24 hours of its launch using the so-called Echo Chamber method in combination with narrative guidance via storytelling.

Without explicitly harmful prompts, the team managed to get the model to provide detailed instructions for making a Molotov cocktail. The attack worked in a standard black-box environment, without internal access to the model, and according to Dark Reading, it would also be effective against earlier models such as Grok-4 and Google’s Gemini.

The approach begins by sowing a subtly poisoned context in which specific keywords are incorporated into seemingly innocent sentences. This context is then reinforced by allowing the conversation to unfold within a continuous narrative.

According to the researchers, the model feels pressure to remain consistent with the narrative line, which gradually steers it toward the goal. Because the prompts never appear explicitly unsafe, traditional keyword and intent filters do not raise the alarm.

In a practical example described by DarkReading, the conversation started with the task of incorporating a few words into a narrative sentence. Gradually, the story was expanded and more technical details were woven in. The model continued to cooperate, partly because the context was built around urgency, safety, and survival. Operational details of the content have been omitted for security reasons.

GPT-5 clearly less robust

According to SiliconANGLE, these findings are consistent with previous analyses showing that, despite improved reasoning abilities, GPT-5 is less robust than GPT-4o against sophisticated prompt attacks. In addition, experts point out that the model is vulnerable to simple obfuscation (i.e., confusing source code), context poisoning over multiple rounds, and risks arising from integrations with agents and external tools.

NeuralTrust’s research shows that security based solely on keywords or intent recognition is insufficient in conversations that span multiple interactions. Effective defense requires conversation-level monitoring and the recognition of subtle persuasion patterns. Without such measures, large language models remain susceptible to jailbreaks that can lead to dangerous output in a short period.

Also read: Is GPT-5 suitable for business use?