Dialog Understanding #10: These Overlooked Tasks

5 min readDec 31, 2023

Traditionally, when we talk about dialog understanding, we think of intent classification and slot filling, which essentially turns natural language text into a structured representation of semantics (frame events). This transformation makes it easy for us to fulfill the user’s request at the code level, often by invoking the appropriate APIs. However, are these two tasks the only dialog understanding-related tasks we need to worry about in reality? Unfortunately, they are not. In this blog, let’s go over some of these often overlooked but equally important tasks if one wants to build a usable conversational user interface.

Is what user want something we serve?

Every business has boundaries for the services they provide to their users, which users may or may not be aware of. When a user asks for something outside of these boundaries, it is perfectly acceptable, or even desirable, for us to detect this deviation and guide the user back to the services we provide.

This problem can be naturally solved if your intention classification is based on binary classification: for each intent, there is a classifier to decide whether the user input triggers that intent. An input that does not trigger any intent in this paradigm can be treated as out of domain. However, this solution relies on the quality of these binary classifiers.

In a more direct approach, this problem has been studied under different names in machine learning. For example: one-class classification, anomaly detection, or out-of-domain detection. In this type of machine learning problem, the goal is to identify instances that deviate significantly from the norm or exhibit unusual behavior within a dataset. The model is trained on a single class (the normal or majority class), and its objective is to detect instances that do not conform to this class, often considered as anomalies or outliers.

When user wants multiple services in one sentence

For efficiency or conciseness, sometimes users express multiple intents in one sentence. For example:

User: "Can you help me book a meeting with Mr. Sloan in their downtown office,
       set a reminder on my calendar, and order a ride for tomorrow morning?"

Such sentences can pose challenge to naively designed intent classifiers, as additional information on other intents can dilute text for every intent. So it is often a good idea to make these classifier more robust by augmenting training examples with extra information when training binary classifiers.

Discourse segmentation, a form of text segmentation, is the closely related NLP task. It refers to the process of dividing a continuous text into smaller, coherent segments or units based on the flow of discourse and the relationships between different parts of the text. The goal is to identify boundaries or breaks between segments that represent distinct ideas, topics, or units of meaning within the overall text.

Note that this problem can be solved as a simple segmentation problem, where one just needs to find the boundary for each unit. Alternatively, it can be solved as a text generation problem first and then segmented so that each segment has the complete information needed. For example, the above user sentence can be processed as:

User: Can you help me book a meeting with Mr. Sloan in their downtown office?
User: Can you set a reminder on my calendar for my meeting with Mr. Sloan 
      in their downtown office?
User: Can you order a ride for me, to Mr.Sloan's downtwon office tmr?

As you can see, things can get more complicated when we try to handle the missing information, for example, the time in this case.

Beyond Yes/No

User do not always answer yes/no questions with a simple yes/no, which can create problem for dialog understanding, for example:

Bot: Are you sure you want to book a ride to Central Park now?
User: I like to take a walk, the weather is so good.

In this example, while user did not answer the Yes/No question directly, they did affirm it. If the dialog understand module do not understand this, chatbot will need to ask the same question again, resulting in bad user experience.

While there is no direct research on this problem, there are some closely related ones, such as Natural Language Inference (NLI). NLI involves determining the logical relationship between two given pieces of text. The goal is to understand whether one statement (hypothesis) can be inferred from another statement (premise) by classifying the relationship into one of the following categories: Entailment (True), Contradiction (False), and Neutral (Irrelevant). Here’s a typical example:

Premise: "The cat is sitting on the windowsill."
Hypothesis: "A cat is resting on the windowsill."

where the relationship is entailment because the hypothesis logically follows from the premise.

The Yes/No inference problem is closely related to this issue on two fronts: first, it also aims to find the logical relationship between two pieces of text, typically one question and one response; second, it is also a three-way classification problem: Yes/No/Irrelevant. Recently, LLM-based solutions have been dominating the leaderboards of NLI tasks. Therefore, it is expected that one can use the same technology to address the Yes/No inference problem, provided a reasonably sized dataset is curated.

Users change their idea

In our 5 levels of conversational user interface, we point out that in addition to user space intent, some time user need to express their intent in dialog kernel space, which is used to change chatbot’s understanding of their intention, For example:

User: I like to buy a ticket to Shanghai.
Bot: When do you want to leave?
User: Wait, I meant Shenzhen.
Bot: Shenzhen it is. When do you want to leave?

User can potentially change idea on every thing. Without accurately understanding these sentences, the user experience can suffer a lot.

In theory, the understanding of these utterances can be handled similarly to intent and slot; we are also converting natural text into a structured representation of semantics. However, an interesting question is whether one can solve these problems in a generic fashion, so that CUI developers only need to worry about their business-related services and interactions? After all, in the GUI world, a user changing their mind is transparent to the application developer for the most part.

Parting words

Unfortunately, dialog understanding is nowhere as simple as intent classification and slot filling. We listed a couple of other problems that dialog understanding module are expected to address in order to provide a usable experience, of course with dialog policy or interaction logic. Luckily, the recent progress in large language model, pre-trained on vast amount of text, presented some more production friendly way of addressing all these dialog understanding problem under a multi-task setting. By fine-tuning a LLM like T5 or Llama, we can have a effective few-short and in context learner for these problems in one model, which is the ultimate goal of our open source project: opencui/dug.

Reference:

5 Levels of conversational UI

We suggest a model based on the nature of semantic frame (with or without slot), what user can do about these semantics…

opencui.medium.com