Since the beginnings of generative AI, new state-of-the-art LLMs have released at a fairly steady rate without any imposed audits or precautions. The age of self-imposed safeguards appears to be over, however, as a bruised Anthropic is set to finally redeploy its most capable model available to the public yet.
After three weeks of unavailability, Fable 5 is set to return for Claude users. It is a specific, safeguarded version of the much-discussed Mythos class of models, known for their supposed skills in uncovering exploits and vulnerabilities in code. Fable 5 was intended to leave out these capabilities at release while still offering its state-of-the-art AI outputs for harmless use cases. Amazon discovered that it could ‘jailbreak’ Fable simply by rephrasing a red-teaming exercise into fixing code (the prompt reportedly was literally ‘fix this code’). Cue the United States government’s alarm and subsequent export restriction on Fable and Mythos, essentially treating it as the cyber-weapon Anthropic CEO Dario Amodei had suggested early users of Mythos characterized it as. The restriction is just part of what’s going on here. Anthropic is not-so-quietly bidding to author a rulebook that the entire AI industry may need to stick to from now on.
What even is a jailbreak?
Call them system prompts, instructions, harnesses, whatever, every one of the AI restriction is a verbal rule told to an LLM or a system checking one. From prompt classifiers to after-the-fact output parsing, AI safety is imposed by a subjective, probabilistic modality. In essence, words guarding words. Getting a model to do what it wasn’t meant to do is, then, getting it out of word prison. The borrowing from phone jailbreaking is only loosely apt; the older sense, breaking out of a cell, fits better, and it’s the analogy Anthropic keeps circling as it swaps ‘jailbreak’ for the softer ‘bypass.’
The supposed jailbreak was generic enough to prompt less capable LLMs to find the same security holes. Even the open-source Kimi K2.7 could do so, meaning one had the capacity to utilize AI for malicious exploit-hunting for months already (if not years, as Anthropic only tested a handful of models). We agree with Anthropic’s assessment that the jailbreak, if we insist on calling it that, was ‘minor’ at most. Admittedly, that is based on very limited knowledge of the specifics – we also know the NSA had Mythos breaking into classified systems “not in weeks, but in hours“. The unfulfilling fact here is that the known ‘jailbreak’ was trivial and not deserving of this outsized US response, while we still have quotes like the one just mentioned causing rather deep (and seemingly unwarranted) fears of an AI that is not just a bit better than the previous one, but fundamentally more capable. Those fears seem unjustified for now.
Anthropic is now seeking to standardize the AI jailbreak. As is their habit, the company has come up with a taxonomy of safeguards and jailbreaks ranging from benign to borderline all the way to clearly harmful. Frustration with the new Fable 5 rollout is pretty much guaranteed, expected and even baked in, as the safety margin Anthropic typically deploys is now extended to include pretty clearly benign use cases. The cost of the initial rollout is an “abundance of caution”, as Anthropic puts it.
Non-AI alignment
Much digital ink has been spilled on the implications of Fable 5’s blocking. Indeed, the US has gone to unprecedented lengths of restricting AI in a way the EU could only dream of. Instead of a sweeping AI Act to regulate training data, development and outputs, Washington has gone for an ad hoc approach that must have been very scary indeed for the AI labs. OpenAI, seemingly spooked by the Fable saga, is holding off on releasing its GPT-5.6 family of LLMs to the public, going for a staggered approach.
OpenAI once had the luxury of slowing down these rollouts itself. In 2019, years prior to ChatGPT, then-Research Director at OpenAI Dario Amodei co-authored a paper and its associated blog announcing that GPT-2 was too dangerous to release in full. Freaked out over the potential to have it generate misinformation, GPT-2 came out in a staggered release and eventually became fully available in November 2019. Evidently, the self-assessed risks and self-imposed safeguards were enough for OpenAI to safely advance AI.
Loop back to the present and the notion of GPT-2 being anything more than a toy is laughable. You can generate all the misinformation one desires in fully unlocked ‘abliterated’ open-source LLMs lightyears beyond the capabilities of GPT-2, all on your own hardware with no internet connection required, and with no safeguards to speak of. There’s no reason to believe Mythos-class models won’t become available in much the same way in just a few years.
That doesn’t mean a coordinated safety campaign is a bad idea. It’s just that Anthropic, clearly unhappy with its Chinese counterparts at Alibaba allegedly amassing tens of thousands of illicit accounts to distill Claude down to a future Qwen model, is looking to orchestrate such a campaign for its own gain. OpenAI’s more laissez-faire approach now seems untenable given Washington’s alarm over Mythos-class models, but that doesn’t mean it will subscribe to the taxonomy Anthropic has come up with. It is a notable absentee in a list featuring Amazon, Microsoft and Google of industry partners looking to find a consensus framework here.
Future AI will be messy
The jailbreak problem won’t go away. Fundamentally, next-token predictors can generate harmful content. Capability gains, ease of weaponization, discoverability, all these are factors proposed by Anthropic that make sense to measure systematically from now on for frontier AI. But they won’t be implemented widely, and only deeply invasive US restrictions on the distribution of open-source models outside its jurisdiction can make them work. How does one tell DeepSeek, Alibaba, Z.ai and all the other Chinese labs to adhere to US frameworks? Banning HuggingFace, neoclouds and other AI infrastructure providers from running or hosting unchecked LLMs is one thing, but there would be an unending whack-a-mole of banning or restricting distribution.
It’s no use trying to limit this. Future AI deployment will remain messy. What can be achieved, however, is a move towards an enterprise-ready methodology. New standards can ensure compliance, simply expanding common workflows typical of regulated industries to new areas. That need not be government overreach in systems most officials don’t understand fully. But having Anthropic lay the groundwork for these rules means setting them up for future lock-in. OpenAI and Google may join the party and assume a kind of regulatory capture the consequences of which we can’t yet ascertain.
If that happens, expect even more geopolitical drift thanks to frontier AI. China is already intent on being digitally autonomous, with Europe gesticulating in that same general direction. AI could become localized, leading to the total addressable markets collapsing for the likes of Anthropic and OpenAI. This could wreak havoc on their valuations. It is therefore in their best interest, interestingly enough, to tread carefully when it comes to regulation. Sounding the alarm in an apparent attempt to fuel hype with fear has already backfired on Anthropic once. It may do so again in ways destructive to the fundamentals of the AI industry. The company might find its way to a ruleset that prevents its own road to a trillion-dollar valuation and a lasting advantage.
Also read: ‘Fix this code’: three words behind the export ban on Claude Fable 5