Building CUI App in LLM Age

Sean Wu
6 min readSep 21, 2023

--

Graphical user interface (GUI) applications need to adapt to various screen sizes and shapes, while conversational user interface (CUI) applications (chatbots) only have to handle text messages. Additionally, everyone knows how to chat, but only very few understand how GUIs work. Given all of this, you might expect that building CUI application or chatbots would be easier, more cost-effective, and, thus, more widespread.

However, the reality is quite different. Building a chatbot often takes more time and costs more compared to developing a web or mobile app with similar functionalities. As a result, there aren’t many usable chatbots out there. What is the problem?

How do we build chatbot today

There are many chatbot coding frameworks and no-code/low-code platforms out there. They all operate under the similar conceptual model and require builders to follow these steps to create chatbots:

1. Anticipate or analyze the user input queries;
2. Extract intents from them;
3. Create labeled examples for these intents;
4. Training NLU models for these new intents;
5. Create elaborate flows to cover all possible conversation paths;
6. Craft responses for every turn in these conversational paths.

None of these steps are difficult; we know how to handle each one of them at human level performance, and making component-wise improvements becomes challenging already. So, perhaps the culprit is this process itself? It is time to consider what can be changed with the advent of ChatGPT.

Do we really need labeling and training for every intent?

To expose the same functionalities, it is often easy to use the same set of APIs. Then the most significant differences between GUI and CUI applications is how their perception layer operates.

The perception layer is responsible for translating user interactions, such as mouse clicks or voice commands, into structured representations of semantics in the form of events. Relying on visual conventions, binding user interactions to semantics is straightforward in GUI applications. For instance, when a user enters input into a text edit box next to a label that reads ‘destination,’ the input is interpreted as the user’s chosen destination.

In CUI applications, the binding process is considerably more challenging. Human language is inherently ambiguous; words and phrases can have multiple meanings. There are often many different ways to convey the same semantics, and context can be hard to discern. As a result, rule-based solutions are unable to accurately extract semantics to support natural conversations.

Before the advent of ChatGPT, dialogue understanding was typically addressed using supervised learning in the form of intent classification and slot filling. In this paradigm, one needed to prepare a separate set of labeled data for each intent by providing the correct output or intent for each sampled input. A model that performs best on this labeled dataset during training is then tasked with making predictions on new and unseen input. Obviously, the size and quality of the labeled dataset determine the performance of intent classification, which requires considerable time and effort to create.

Instruction-tuned LLMs are first pre-trained on vast amounts of text data and then fine-tuned with many common NLP tasks. Instruction-tuning allows them to solve tasks expressed in natural text, and pre-training equips them with the ability to generalize from examples and perform tasks they haven’t seen during fine-tuning. As a result, the zero-shot capability demonstrated by LLMs, such as ChatGPT, suggests that we may not really need the laborious labeling process or time-consuming training anymore.

Do we need an intent for every query we want to serve?

One of the main issues with existing chatbot building is its rigidity; for every user query that we want to serve, we need an intent or something equivalent, like a question-answer pair, to cover it.

While the effort for creating an intent, particularly a question-answer pair, is not much, they add up. Furthermore, this often ends up with redundant copies of the same information. Keeping them consistent can be a challenge when the content we need to expose to our users increases.

Retrieval-augmented generation (RAG) is a technique that combines the power of LLMs with the capabilities of information retrieval systems. This makes it a powerful tool for businesses, as it can be used to generate accurate, informative, and up-to-date text content on a variety of topics. By relying only the business-dependent and up-to-date information as context, RAG helps improve the accuracy and reliability of LLMs, as generation is grounded in the supplied text as context. RAG allows businesses to serve informational and non-critical user queries using existing content, so there is no need to recreate copies of existing content just to expose it to our users conversationally. There is no need to explicitly create intent to handle non-critical user informational queries anymore.

Is flow-based user interaction design good for CUI?

Graphical User Interface (GUI) interactions are often designed with a flow-based approach, allowing users to navigate through the interface by progressing logically from one screen or page to another. Since users can only interact with the frontend in the way that is implemented on each page, it is easy to make sure users always go somewhere.

However, under CUI, users can say anything at any given turn. Creating intricate flows to cover all possible user input fails to work. We need to cover exponentially many conversational flows in order to provide a reasonable user experience, which can be prohibitively time-consuming and expensive. Failing to do so will result in a user-repelling experience of ‘Sorry, I do not get that.’

Assuming that content is exposed using RAG, CUI only need to connect the user with the business services. Instead of anticipating every possible user input under flow-based approach, we can adopt a schema-guided approach where we focus on the APIs we want to expose conversationally: we only need to define how to create an instance for every required type via conversation, including both the function type and the types needed by these API functions for input and output parameters.

It is easy for the chatbot to identify which API function the user wants to trigger, and it is also straightforward to keep track of which slot is missing a value in the current API functions. Based on the definition above on how to extract a value for a type from the user conversationally, along with a runtime designed to bring the service to the user as quickly as possible, a chatbot can now handle any possible conversational flow without the need for developers to explicitly define the behavior for that flow.

Parting words

LLMs have changed how we build chatbots for good. We can now build chatbots with a dual-process approach: using RAG as system 1 for informational and non-critical user queries, and leaving transactional or critical queries to system 2, more deterministic software solutions. The steps required to build a conversational user experience under this new stack are as follows:

For informational or non-critical user query
1. Add content to some RAG system

For transaction or impactful user query
1. Decide at schema level what APIs should be exposed;
2. Add dialog annotation to types and their slots to define interaction logic.
3. (Input) Add expression exemplar to hot fix understanding
3. (Output) Use template to define the meaning should be communicate to user.

Deploy the dual process system and watch it deliver.

Since your existing development team already has the related business background, this process has the potential to be a cost-effective approach needed to finally make conversational experiences a reality: users can now get what they want by using their own terms to communicate their needs, which requires no learning or thinking on their part.

Reference:

--

--