We are living in an exciting time of rapid development for large language models (LLMs), thanks to ChatGPT. Impressive results, as well as substantial investments from tech giants like Google, Microsoft and OpenAI, have generated significant excitement about the potential of LLMs to revolutionize many industries.
Given the multitude of articles and blog posts that make grand claims about LLMs’ capabilities, figuring out what is hype and what is not can be overwhelming for anyone with a busy schedule. To assist product owners and managers in navigating this noisy landscape, let’s we present the following recommendations on where integration of LLMs into their products is sensible today.
- Consider using LLMs for all your symbolic tasks such as machine translation, at least as a starting point.
- Do use LLMs for under-constrained, open-ended world task. An example could be generating a thank-you letter after a successful event.
- Do use LLMs for well-constrained, closed-ended world tasks where 100% correctness is not required. This includes internal use cases, and low impact external use cases.
- Don’t use LLMs for closed-ended world tasks when high accuracy is required.
What do we mean by symbolic tasks and world tasks? To answer this, let’s first revisit a key linguistics concept.
Grounding
Language is a symbolic representation of real-world objects, concepts, actions, and ideas, designed to communicate an understanding of the world. Grounding, or the linking between linguistic representations (words and phrases) and concrete elements in the real world, is essential for natural language understanding.
Each piece of text needs a connection to the world to be meaningful, and its truth can only be determined with grounding. For instance, ‘New York City is the largest city in the U.S.A’ is true in our current world, but not necessarily true in another world or at a different time in our world. Clearly any change in linkage can result in a completely different understanding. To make things more challenging, the linkage itself is arbitrary.
While the text used to train LLMs is often generated based on the author’s grounding, LLMs can only process it at the symbolic representation level since they lack sensory interaction with the real world. So the text LLMs generate is simply based on linguistic patterns in human-generated text, it is not a direct rendering of some a specific world model. It is the receiver who reconstructs the meaning from received text using on their grounding, or beauty is in the eye of the beholder.
Many problems can be solved by instruction-tuned LLMs framed as text generation tasks. To determine if a specific task can be addressed by LLMs, it’s helpful to ask whether the generated text should be evaluated with grounding.
Use LLMs on all symbolic tasks now.
Tasks that can be evaluated without grounding are called symbolic tasks. These tasks often involve generating different representations of either a portion or the entire input text. There are many useful symbolic tasks, including translation, named entity recognition, text classification, sentiment analysis, grammar correction, and slot filling. Dialog understanding, which primarily involves converting user requests in text into structured representations of meaning, is also a symbolic task.
Statistical learning models have been the preferred choice for solving symbolic tasks for a while, and LLMs-based solutions take this further:
- Comparing to traditional shallow models, the transformer architecture has more modeling capacity since it is a deeper network and can take more context into computation.
- Rather than solving each task from scratch, typically involving a separate labeled dataset, the new solution first pretrain a model with a vast amount of text and then fine-tune it for one or more specific tasks. This approach substantially reduces the requirement for labeling data for each task.
To solve a symbolic task with LLMs, you can try things in the following order:
- Always start with prompt engineering. Instruction-tuned LLMs, trained on many tasks, can solve tasks under an in-context learning setting using prompt engineering. You can quickly test ideas and build proof-of-concept applications via prompting, without the need for extensive data collection and model training.
- If you need the more performance for inference, particularly for the task that is not included in the instruction tuning, or to save on the inference cost (depending on number of tokens in the prompt), LLMs can be fine-tuned for your task.
Unless the source or target language of the task is not well represented during the pretraining, there is no reason to continue pretraining for symbolic tasks. Furthermore, symbolic tasks generally do not require many parameters, so it is possible to use smaller models for faster inference without losing much accuracy. Also, it’s important to note that not all NLP tasks are symbolic tasks; for example, text entailment.
A mixed bag for world task
On the other hand, tasks that can only be evaluated with grounding are called world tasks. The expected output are generally not explicitly present in the input text and must be generated based on real-world knowledge in addition to input text. An example of a world task is three-digit multiplication.
Hallucination in large language models (LLMs) refers to the tendency of the LLMs to generate text that appears to be correct but is actually false. Since only “correct” response for world tasks are useful, can we still use LLMs? It depends on:
- Is it a closed-ended or open-ended problem? Closed-ended questions can be answered with a specific response. In contrast, open-ended questions can be answered in various ways, allowing the respondent to provide their own opinion or explanation. In reality, most tasks fall along a continuous spectrum between these two extremes.
- Is it a copilot or autopilot use case? Under an autopilot use case, the result is consumed directly, so liability can become an issue due to hallucination. In a copilot use case, the result is expected to be verified by a human, and LLMs are not liable for the correctness of the response.
In general, world tasks requires LLMs with more parameters and trained with more tokens. Since they are typically solved as end-to-end problems using prompt engineering, it is very useful to experiment with various prompts to find the best one.
Use LLMs on open-ended tasks
Trained on trillions of tokens from web dumps, LLMs have been exposed to almost all human scenarios in the text domain. Therefore, they are a good choice for creative world tasks, such as writing a letter to dispute your traffic ticket. For this type of task, there is an important knob that you can adjust called ‘temperature’. This parameter controls the level of randomness in the generated text. The choice is between:
- High Temperature: When the temperature is high (e.g., 0.8), the generated text is more random and creative. It allows for a wider range of word choices and sentence structures. This can make the output more unpredictable but also potentially less coherent.
- Low Temperature: When the temperature is low (e.g., 0.2), the generated text is more focused and deterministic. This can result in more coherent but potentially less creative output.
For close-ended tasks, only use LLMs under non-critical use cases.
Many factors can hinder the generation of a ‘correct’ result: the training data may be incorrect, or missing from the web dump, and there can be mishandling of the context, etc. By improving the quality of the data used during training and tuning, we can reduce the chance that resulting LLMs will hallucinate, but we can never completely eliminate them. It is generally more challenging for LLMs to handle factual or closed-ended world tasks. So only for world copilot or non-critical use cases, one can begin experimenting with LLMs, and retrieval augmented generation(RAG), known as RAG, is a good way to get started.
Parting words
Like it or not, LLMs are making inroads in many aspects of our lives. As a product owner, it is time to figure out how best to use this emerging technology to improve your product and business. This article introduced couple of simple questions to help you decide whether or not to use LLMs for a given task, and I hope it can make your life easier. Of course, world is not black and white, so if you have concrete task that not covered by this decision flow, leave comments.
Reference: