3 min Applications

Anthropic challenges users to jailbreak AI model

Anthropic challenges users to jailbreak AI model

Even companies’ most permissive AI models have sensitive topics their creators would rather not talk about. Think weapons of mass destruction, illegal activities or Chinese political history.

Over the years, creative AI users have tried all sorts of things to get these models to give forbidden answers anyway. For example, through bizarre text strings.

Claude-maker Anthropic released a new system of Constitutional Classifiers. According to the company, this should filter the vast majority of such jailbreaks. Having endured more than 3,000 hours of bug bounty attacks, Anthropic is now inviting the general public to test the system and see if it can still be bypassed.

AI with a constitution

In a new paper and accompanying blog post, Anthropic explains that the Constitutional Classifier system derives from the Constitutional AI system used to train the Claude model. This system is based on a constitution of natural language rules. These rules determine what types of content are allowed, such as lists of common drugs, and which ones are prohibited, such as methods of obtaining prohibited chemicals.

Then, Anthropic generates thousands of synthetic prompts that can elicit both acceptable and forbidden responses. These prompts are translated into different languages, adapted from known jailbreaking techniques, and supplemented with automated red-teaming prompts. These try to find new ways to crack the model.

This leads to a robust dataset with which new, more resistant classifiers can be trained. These classifiers filter both user input and model output. The classifiers analyze queries using detailed templates that indicate what malicious information should be blocked and how users might try to hide or encrypt that information.

A specially trained classifier then calculates whether the generated text contains prohibited content. This is done for each new word the model produces. If the probability of a forbidden subject goes above a certain threshold, then the output stops.

Who can crack Claude?

Since August, Anthropic has been running a bug bounty program through HackerOne. The company is giving $15,000 to whoever can find a universal jailbreak—a technique that can get the Constitutional Classifier to answer 10 forbidden questions. So far, 183 experts have spent a combined 3,000 hours trying, with the best attempt eliciting useful information on only 5 of the 10 questions.

In addition, Anthropic tested the model against 10,000 synthetic jailbreak prompts, which were generated by Claude himself. The Constitutional Classifier blocked 95% of these attempts, while an unsecured version of Claude managed to block only 14%.

Despite these successes, Anthropic acknowledges that the system requires considerable computing power—23.7% more than a standard Claude query, which increases both cost and energy consumption. Also, the model blocked 0.38% more innocent queries than an unsecured Claude, which Anthropic considers an acceptable margin of error.

Anthropic does not claim that this system is proof against all jailbreaking techniques. But it does make jailbreaks much more complex and argues that the constitution can be quickly amended to counter new attack techniques.

Eight questions on chemical weapons

Anthropic is now so confident in its Constitutional Classifier that it allows public testing. Until Feb. 10, Claude users can visit the testing site and try to break through the new protections by getting answers loose to eight questions about chemical weapons.