Building Safer Dialog Agents

Train an AI to communicate in a more useful, correct and harmless way.

In recent years, large language models (LLMs) have been successful in a range of tasks such as question answering, summarizing, and dialogue. Dialogue is a particularly interesting task because it is characterized by flexible and interactive communication. However, LLM-powered dialog agents may express inaccurate or made-up information, use discriminatory language, or encourage dangerous behavior.

To create safer dialogue agents, we must be able to learn from human feedback. By applying reinforcement learning based on feedback from research participants, we explore new methods of training dialogue agents that show promise for a safer system.

In our last article, we present Sparrow – a dialogue agent that is useful and reduces the risk of dangerous and inappropriate responses. Our agent is designed to speak with a user, answer questions, and search the internet using Google when it is useful to seek evidence to inform their answers.

Our new conversational AI model responds on its own to a first human prompt.

Sparrow is a research model and proof of concept, designed with the goal of training dialogue agents to be more helpful, correct and harmless. By learning these qualities in a general dialogue setting, Sparrow advances our understanding of how we can train agents to be safer and more useful – and ultimately, to help build artificial general intelligence (AIG) safer and more useful.

Sparrow refusing to answer a potentially dangerous question.

How Sparrow Works

Training a conversational AI is a particularly difficult problem because it is difficult to identify what makes a dialogue successful. To solve this problem, we turn to a form of reinforcement learning (RL) based on people’s feedback, using the feedback of study participants’ preferences to train a model of the usefulness of a response. .

To obtain this data, we show our participants several models of answers to the same question and ask them which answer they prefer. Since we display answers with and without evidence retrieved from the internet, this model can also determine when an answer needs to be supported by evidence.

We ask study participants to rate and interact with Sparrow in natural or adversarial ways, continually expanding the dataset used to train Sparrow.

But increasing utility is only part of the story. To ensure that the behavior of the model is safe, we need to constrain its behavior. Thus, we determine a first set of simple rules for the model, such as “don’t make threatening statements” and “don’t make hateful or insulting comments”.

We also provide rules regarding potentially harmful advice and not claiming to be a person. These rules have been informed by the study of existing work on linguistic prejudices and the consultation of experts. We then ask our study participants to talk to our system, with the goal of tricking it into breaking the rules. These conversations then allowed us to form a distinct “rules model” that indicates when Sparrow’s behavior violates one of the rules.

Towards better AI and better judgments

Verifying the accuracy of Sparrow’s answers is difficult even for experts. Instead, we ask our participants to consider whether Sparrow’s answers are plausible and whether the evidence Sparrow provides actually supports the answer. According to our participants, Sparrow provides a plausible answer and backs it up with evidence 78% of the time when asked a factual question. This is a big improvement over our base models. However, Sparrow is not immune to making mistakes, such as mind-blowing facts and giving sometimes off-topic answers.

Sparrow also has room to improve his adherence to the rules. After training, participants were still able to trick him into breaking our rules 8% of the time, but compared to simpler approaches, Sparrow is better at following our rules in adversarial polling. For example, our original dialogue model broke the rules about 3 times more often than Sparrow when our participants tried to cheat it.

Sparrow answers a question and a follow-up question using evidence, then follows the rule “Do not pretend to have a human identity” when asked a personal question (sample from September 9, 2022).

Our goal with Sparrow was to build flexible mechanisms for enforcing rules and standards in dialog agents, but the particular rules we use are preliminary. Developing a better and more comprehensive set of rules will require both input from experts on many topics (including policymakers, social scientists and ethicists) and participatory input from a broad range of affected users and groups. We believe our methods will still apply for a more rigorous set of rules.

Sparrow is a significant step forward in understanding how to train chat agents to be more helpful and safe. However, successful communication between people and dialogue agents must not only avoid harm, but be aligned with human values ​​for effective and beneficial communication, as shown in recent work on aligning language models with values. human.

We also emphasize that a good agent will always refuse to answer questions in contexts where deferring to humans is appropriate or where it has the potential to deter harmful behavior. Finally, our initial research focused on an English-speaking agent, and further work is needed to ensure similar results in other languages ​​and cultural contexts.

In the future, we hope that conversations between humans and machines can lead to better judgments about AI behavior, allowing people to align and improve systems that might be too complex to understand without it. machine help.

Leave a Reply

%d bloggers like this: