A Bot’s View of Multi-Turn Conversations

Sean Wu
5 min readAug 30, 2023

--

It is commonly accepted that a single-turn interaction is insufficient for chatbots. Support for multi-turn conversations is expected in every conversational user interface framework or platform. However, the precise reasons behind this expectation are rarely discussed. In this essay, let’s closely examine this from a bot’s perspective so that we, as chatbot developers, can be aware of two types of multi-turn conversations.

What is a turn anyway?

Turn-taking is fundamental in conversations: when two or more people engage in a conversation, they must take turns speaking and listening. The simplest form of conversation is a single-turn interaction. From a chatbot’s perspective, a single-turn conversation involves only one exchange between the user and the chatbot. Typically, the user asks a question, the chatbot provides an answer, and the conversation concludes.

User: set an alarm for 5am.
Bot: Alarm is set for 5am tomorrow, 16 hours from now.

However, not all tasks can be communicated in single-turn conversations: complex tasks may require more than one back-and-forth exchange.

User: Send a text to James.
Bot: What would you want to say to him?
User: I am running a bit late, will be there in 20 minutes.
Bot: Ok, message sent to James

Assuming we know how to build a capable chatbot to handle single turn conversation, what do we need to support multi-turn conversations? It turns out, we only need two more capabilities, one for each multi-turns use case.

Context retention for reactive multi-turn (user triggered)

Several reasons make single-turn conversations insufficient so the users have to engage in additional turns. For instance, a user’s initial request might be under-specified, leading to a non-specific response from the bot. In such cases, users may need to provide additional constraints. Users may also change their minds based on bot’s response. For example:

User: Two tickets for 8o'clock Oppenheimer, please.
Bot: Sure, I have two tickets in Row 8, 8E and 8F. That is $32 in total, please.
User: Do you have something toward back? I like to see it from far.
Bot: Sure, I have two tickets in Row 30, second to last row, also in E and F. Is that Ok?
User: Great, I will take them.

To handle user triggered multi-turn conversation, we need to add context retention support to chatbot. Context retention is the ability of the chatbot to remember the conversation history and use it to generate more relevant and informative responses. This way, the chatbot can keep track of what has been said before in order to provide helpful and consistent responses.

There are two different ways that we can build context retention into chatbot: implicit context retention via end-to-end neural modeling, or explicit context retention via dialog state tracking.

The end-to-end approach involves training chatbots with neural networks that can implicitly understand and remember the conversation context without explicit storage. For example, the ChatGPT is also trained with the follow dataset, where input includes not only the system prompt that instruct the response generation in general, but role aware conversational history, for example:

input: [
{"role": "system", "content": "You are a travel agent."},
{"role": "assistant", "content": "What can I do for you?"},
{"role": "user", "content":...}
...
{"role": "assistant", "content":...}
{"role": "user", "content": "what is the weatcher on that day?"}]
output: "It will be sunny on Monday."

Dialog state tracking retains context by managing an accumulated dialog state to summarize the user’s ongoing requests. The state is a structured representation, updated each turn based on the user’s current input, as well as dialog expectation that captures the information expected from user.

By converting user input in natural language into this structured representation, it becomes easy to incorporate business conditions, such as inventory. Consequently, the chatbot is enabled to provide more business-aware responses. Clearly, with dialog understanding being addressed by powerful Large Language Model (LLM)-based solutions, dialog state tracking is not difficult, as it only requires a set of CRUD (Create, Read, Update, Delete) operations on the accumulated dialog state.

Dialog policy for proactive multi-turn (chatbot triggered)

At times, a user’s request may not fully align with the business objectives, or it could be under-specified, necessitating the bot to gather additional information. Conversely, requests could be over-specified, resulting in a scenario where the chatbot cannot fulfill the request. In such cases, the chatbot should initiate clarification and follow-up procedures based on the business logic. This ensures that the bot and the user can promptly reach an agreement on the mutually beneficial terms of service.

Conversational interaction logic, historically known as ‘dialog policy’, is essential for supporting proactive multi-turn conversations. It’s worth noting that this dialog policy often depends on backend APIs, enabling the conversation to adapt to various business conditions. For instance, if the T-shirt size a user wants is out of stock, you wouldn’t want to waste the user’s time by asking them to choose a color.

Traditionally, interaction logic is defined in a flow-based fashion for Graphical User Interface (GUI) applications. Since users can only interact with your application in the ways that were programmed for them, you can always ensure that every interaction a user can take leads somewhere. The worst thing you can do is to make it harder for a user to do certain things by not providing the necessary interaction paths due to budget constraints.

For chatbot, or Conversational User Interface (CUI) applications, users can say anything at any given turn, and when what they say is relevant, the chatbot needs to react properly to maintain user trust and confidence. Anticipating all possible user input at all times and then enumerating all possible conversation paths is not feasible and is prohibitively expensive. So we need an alternative approach where conversation logic can be defined in a type-based fashion.

While prompt engineering holds promise for programming using natural language, it is generally not sufficiently controllable to serve as a useful method for defining dialog policy at present time yet. And creating examples to train chatbot to follow desired dialog policy is a bit silly as the examples are not the most efficient way to define interaction rules or dialog policies.

Parting words

Supporting multi-turn conversations is crucial for delivering natural user interactions across a wide range of services. Recent advancements in Large Language Models (LLMs) have made zero-shot language understanding possible. Enhanced NLU performance greatly simplified dialog state tracking implementation. The primary challenge we now need to address is to translate business requirements into conversational interaction logic, commonly known as dialog policy, so that we can build chatbots that assist users while achieving business objectives.

Reference:

--

--