As OpenAI’s ChatGPT continues to change the game for automated text generation, researchers warn that more measures are needed to avoid dangerous responses.
While advanced language models such as ChatGPT could quickly write a computer program with complex code or summarize studies with cogent synopsis, experts say these text generators are also able to provide toxic information, such as how to build a bomb.
In order to prevent these potential safety issues, companies using large language models deploy safeguard measures called “red-teaming,” where teams of human testers write prompts aimed at provoking unsafe responses, in order to trace risks and train chatbots to avoid providing those types of answers.
However, according to researchers with Massachusetts Institute of Technology (MIT), “red teaming” is only effective if engineers know which provocative responses to test.
In other words, a technology that does not rely on human cognition to function still relies on human cognition to remain safe.
Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab are deploying machine learning to fix this problem, developing a “red-team language model” specifically designed to generate problematic prompts that trigger undesirable responses from tested chatbots.
“Right now, every large language model has to undergo a very lengthy period of red-teaming to ensure its safety,” said Zhang-Wei Hong, a researcher with the Improbable AI lab and lead author of a paper on this red-teaming approach, in a press release.
“That is not going to be sustainable if we want to update these models in rapidly changing environments. Our method provides a faster and more effective way to do this quality assurance.”
According to the research, the machine-learning technique outperformed human testers by generating prompts that triggered increasingly toxic responses from advanced language models, even drawing out dangerous answers from chatbots that have built-in safeguards.
AI red-teaming
The automated process of red-teaming a language model depends on a trial-and-error process which rewards the model for triggering toxic responses, says MIT researchers.
This reward system is based on what’s called “curiosity-driven exploration,” where the red-team model tries to push to boundaries of toxicity, deploying sensitive prompts with different words, sentence patterns or content.
“If the red-team model has already seen a specific prompt, then reproducing it will not generate any curiosity in the red-team model, so it will be pushed to create new prompts,” Hong explained in the release.
The technique outperformed human testers and other machine-learning approaches by generating more distinct prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of inputs being tested compared to other automated methods, but it can also draw out toxic responses from a chatbot that had safeguards built into it by human experts.
The model is equipped with a “safety classifier” that provides a ranking for the level of toxicity provoked.
MIT researchers hope to train red-team models to generate prompts on a wider range of elicit content, and to eventually train chatbots to abide by specific standards, such as a company policy document, in order to test for company policy violations amidst increasingly automated output.
“These models are going to be an integral part of our lives and it’s important that they are verified before released for public consumption,” said Pulkit Agrawal, senior author and director of Improbable AI, in the release.
“Manual verification of models is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and trustworthy AI future,” Agrawal said.