- A group of researchers said they have found ways to bypass the content moderation of AI chatbots.
- One researcher involved in the study told Wired there was “no way” to patch the attacks.
A group of researchers said they have found virtually unlimited ways to bypass the content moderation on major AI-powered chatbots, and no one is quite sure how to fix it.
In a report released last week, researchers at Carnegie Mellon University in Pittsburgh and the Center for AI Safety in San Francisco said they had found ways to break the strict safety measures enforced on mainstream AI products such as OpenAI’s ChatGPT, Google’s Bard, and Anthropic’s Claude.
The “jailbreaks” were created in a completely automated way which they warned allowed for the potential to create a “virtually unlimited” number of similar attacks. The researchers found the hacks undermined most major chatbots’ guardrails and could theoretically be used to prompt the bots to generate hateful content or advise on illegal activities.
And researchers say there is no current solution to fix this.
“There’s no way that we know of to patch this,” Zico Kolter an associate professor at CMU who was involved in a study told Wired. “We just don’t know how to make them secure.”
Armando Solar-Lezama, a computing professor at MIT, told Wired that it was “extremely surprising” the attacks, which were developed on an open-source AI model, should work so well on mainstream systems. The study raises questions about the safety of publically available AI products such as ChatGPT.
When questioned about the study, a Google spokesperson previously told Insider that the issue affected all large language models – adding the company had built important guardrails into Bard that they planned “to improve over time.” A representative for Anthropic called jailbreaking measures an area of active research and said there was more work to be done.
Representatives for OpenAI did not immediately respond to Insider’s request for comment, made outside normal working hours.