A software maturity model is a tool that assesses the effectiveness of software and identifies necessary capabilities to improve user experience. It functions as a reference system for stakeholders such as project owners, product managers, and development teams to discuss the balance between experience and cost, particularly for new types of applications.
Existing chatbot or conversational AI maturity models often encompass end-to-end behavior, involving both the user interface (UI) and underlying backend services. Since we have been developing the backend for graphical user interface (GUIs) for many years, and the same kind of backend can be used by chatbots, modeling the backend together is unnecessary. Constructing a maturity model solely for conversational user interfaces (CUIs) can simplify the design and implementation of CUI application such as chatbots, agents and copilots.
Background
The conversational user interface (UI) can be decomposed into three steps, based on how user utterance is processed:
- The dialog understanding module is responsible for converting user utterances into a frame, which is a structured representation of the services that users want.
- The dialog management is responsible for interacting with backend APIs and figuring out how to communicate with the user effectively to serve their needs while achieving the business objectives (such as up-selling when opportunities arise). Clearly it needs to be “programmed” for each business.
- The text generation module takes a dialog act generated by the dialog manager, and produces corresponding natural text.
Businesses develop Conversational User Interfaces (CUIs) to provide services to users, regardless of the language they speak. Therefore conversational interface can be divided into two layers: the language layer and the interaction layer.
At the language layer, in both dialog understanding and text generation, the focus is specifically on the choice of words, tone, style, and linguistic elements used in the conversation to effectively convey the intended message and facilitate smooth interaction between the user and the system.
At the interaction layer, conversational interaction is handled using semantic frames in a language-independent fashion, which can be modeled in five levels depending on how users can express these semantic frames.
1. Single turn only
In this level, a user needs to express what they want completely and correctly in one sentence, in order to get what they want. The semantic frame that can be reliably expressed in single sentence is called a proposition frame, or a frame without any slots. An example is frequently asked questions:
User: What is your hours?
Bot: We open every day from 5:00pm to 9:00pm.
Lacking structure, such frames have limited expressive power. For example, if you have different hours for different days, then you need to have one frame for each day. But since the response is always atomic and context-independent, this level of CUI can be easily built by maintaining a list of question and answer pairs. Dialog understanding can be reduced to find the question that best matches the user utterance, then response generation simply forwards the corresponding answer to the user.
2. Multiple turn ready
While it is possible to express a semantic frame with multiple slots in one sentence, users may not consistently do so, particularly when they are not familiar with the service. For example, when buying a movie ticket, users might provide the required information on the title, showtime, and quantity, but overlook the format since they may not be aware of the availability of IMAX shows.
When there is missing information in the user’s initial utterance, the chatbot will need to engage in conversations to collect it, potentially over multiple turns, in a process commonly known as slot filling.
User: I like to buy a ticket to Shanghai.
Bot: When do you want to leave?
User: How about this Friday.
...
Dialog management module needs to be programmed to consider factors such as inventory availability, context, and even user history when planning a conversation to fill slot. For example, in movie ticketing, when asking the user about their preferred showtime, it’s better to provide a list of showtimes that still have open seats. This way, users can choose from options we can accommodate, eliminating the possibility of them being disappointed by hearing ‘sorry, it’s sold out’ after making their selection.
Bot: What time would you like to see the movie? We still have open
seats for two slots: 1). 8:00pm and 2). 10:00pm.
For boolean slots, it needs to figure out what the user actually implies. For example, the following user response should be interpreted as a ‘Yes’.
Bot: Are you sure that you want a large pizza?
User: I am so hungry and I can eat a cow now.
To communicate semantic frame with slots to users, response generation is required to generate natural text for them. There are two ways to do this: template-based for better control and model-based for more diversity. For example, we can generate the following dynamic greetings using the provided template with time, user sex, and last name:
Good {time_of_day}, {prefix_for_sex(sex)} {last_name}, how can I help you?
Building an effective Level 2 conversational user interface for arbitrary APIs is a lot more challenging. Not only do we need to detect the semantic frame (sometimes known as intents), but we also have to fill slots while addressing many concerns: how to extract values that are not well presented on the web (such as new products), and when and how to prompt users when a slot is missing a value, including how to provide value recommendations, and what happens when the user’s initial choice is not serviceable, etc.
3. Allow corrections and alternative semantics
One of the limitations thus far is that users cannot make mistakes or change their minds; otherwise, they have to restart from scratch. This can be problematic for complex services with many slots in the frame. Level 3 CUI adds support for conversation correction, making the process of obtaining services via conversation more efficient for users.
User: I like to buy a ticket to Shanghai.
Bot: When do you want to leave?
User: Wait, I meant Shenzhen.
Bot: Shenzhen it is. When do you want to leave?
Allowing correction requires developer to support CRUD (create, read, update and delete) operations on slots, with create and delete mostly useful for multi value slots, which can add significant development cost. To illustrate the potential difficulty associated with supporting CRUD operations, consider this user utterance after they put three drinks in the shopping cart:
User: Can you add sugar to the large one?
In addition, we should also support slot operations with alternative semantics to facilitate conversational interaction. For example, when filling the departure date slot, the bot needs to filter the candidate dates in response to the following negating instructions:
User: I cannot leave on Monday.
For many real-world use cases, Level 3 is the minimum level that a CUI needs to reach to be considered usable.
4. Multiple tasks support
When the task a user wants to achieve becomes even more complex, they might need to juggle between multiple services simultaneously. At this level, a chatbot can naturally switch between different semantic frames and use conversational history to automatically figure out any missing information from the user’s utterances.
User: I like to buy a ticket to Shanghai.
Bot: When do you want to leave?
User: This Friday. By the way, how is the weather like over there?
Bot: It is clear sky in Shanghai this friday, in mid 20s.
Another critical capability at this level is for a chatbot to understand utterances that contain multiple intents and be able to handle some intent first and return to the rest later. For example:
User: I like to buy a ticket to Shanghai. Can you also book a tax from
airport to my hotel.
…… (Interactions for booking the ticket)
Bot: I also reserved a Taxi for you. The driver’s number is 139xxxxxxx,
please call him after you get your luggage.
When user intention is not clear, chatbot can switch to a system skill to help user to narrow down their intention quickly. How do we support without additional effort level from CUI developers is interesting question. A chatbot with Level 4 CUI is good.
5. Personal touch
Clearly, for service-oriented conversations, Level 4 CUI already provides a very usable experience. Users should be able to effectively get what they want and move on. However, the user experience can be further improved with a personal touch. For example, offering a timely congratulation.
User: my son has just passed his test and we are considering adding him
onto my insurance, is that something i'll be able to do?
Bot: Congratulations to your son on passing his driving test! ...
At this level, dialog understanding needs to pick up events that are not directly related to the service that the business wants to sell but carry special meaning for their users. This allows us to later generate personalized conversational icebreakers. Additionally, text generation can be enhanced with extra controls such as style. For example, based on the user’s demographic or sentiment, we can use emojis more frequently for a younger audience. A chatbot with Level 5 CUI is perfect.
Parting words
The five levels proposed here represent progressive improvements in usability, along with the increasing effort level required to build the corresponding experience, making it easier for product owners to balance effort level and user experience. The end experience will also depend on language-level considerations, but hopefully, it is clear that this is an orthogonal concern and can be addressed separately.
P.S.
ChatGPT has demonstrated that conversational user interfaces can be a great way to access valuable static knowledge buried in unstructured text. One might hope that building a functional conversational user interface to access services is easy now. Unfortunately, good conversational experiences are still rare.
In GUI, developers have full control over what users can do, so they only need to prepare for the options they provide to build a usable interface. In CUI, users can say anything at any turn, so developers need to prepare for all reasonable interactions. With existing flow-based design and implementation, this means enumerating all possible conversation paths, which is prohibitively costly. Luckily, it is possible to focus on next steps to build frame, which make it tractable.
Also, this writing assumes that we are building a conversational user interface for service APIs. If you only want to expose information buried in documents, you can use Retrieval Augmented Generation (RAG). It is also possible to combine them in a dual-process approach.
Reference:
https://arxiv.org/pdf/2012.11976.pdf
https://en.wikipedia.org/wiki/Frame_semantics_(linguistics)
https://medium.com/p/b58d11188436