Dialog Understanding #11: Beyond Fine-tuning with Head

Sean Wu
5 min readAug 14, 2024

--

As soon as the transformer-based BERT was introduced, it forever changed how natural language understanding (NLU) is solved. Instead of training a shallow model from scratch, we could simply fine-tune a pretrained BERT model. Taking text classification as an example, fine-tuning amounts to adding a classification head on top of the special token [CLS] of existing transformer architecture. This head is typically a logistic regression, and it uses the last layer embedding as features. We can use labeled examples to train weights for this classification head.

Why is fine-tuning favored? To map the input to output reliably, we need to learn how to represent input so we can make correct predictions for inputs that we have not seen during training. For shallow models like logistic regression, we need to learn this from scratch for every task, so a large number of labeled examples are required during training. When fine-tuning with a head, input features are computed in a context-dependent way using the weights learned during task-independent pretraining on large text corpora. Since we only need to learn the mapping between that embedding space and output classes, a smaller set of labeled examples can give us equivalent results.

While fine-tuning BERT-like encoder models has proven to be effective, it has some issues. For each task, we need to train a different head, thus needing some labeled examples. More importantly, the industry has generally moved on to decoder-based models like ChatGPT. On one hand, there have been no new open-source or open-weight BERT models released in a couple of years; on the other hand, Meta open-sources a new LLaMA model every year. So it is time to figure out how to use new decoder-based models to solve NLU tasks, so that we can take advantage of the huge amount of capital invested in training these new models (Meta reportedly spent $100M to train LLaMA 3).

Prompting for zero shot learning

Before instruction tuning, to teach a computer to do something for us, we needed to hard code the procedure, or provide labeled examples to some machine learning algorithms so that they could figure out how to map input to the right output. With instruction-tuned LLMs, we have another way: prompting.
A prompt for an LLM is a piece of text input that guides or instructs the model to generate a specific type of response or perform a particular task. For example, the following prompt can be used to classify text:

You are a text classification system. Your task is to classify the given text into one of the following categories:
- Business
- Technology
- Sports
- Entertainment
- Politics

Analyze the content and context of the text carefully, then provide your classification along with a brief explanation for your decision.

Text: {{Insert text here}}
Category:

According to Teven Le Scao’s study on a range of text classification problems, prompts like this, without a single labeled example, can solve problems at a level that requires hundreds of examples for fine-tuning with head. This zero shot capability can greatly reduce the upfront cost of build intelligent systems.

In-context Learning

The effectiveness of prompts can be easily increased if you have a couple of labeled examples. In addition to instruction-following capability, these instruction-tuned models also exhibit in-context learning capability. By recognizing patterns, it seems these models can learn from a few examples that are embedded in the prompt. For example:

You are a text classification system. Your task is to classify the given text into one of the following categories:
- Business
- Technology
- Sports
- Entertainment
- Politics

Analyze the content and context of the text carefully, then provide your classification along with a brief explanation for your decision.

Here are some examples:
Text: "The stock market saw significant gains today, with the S&P 500 rising 2%."
Category: Business/Finance

Text: "Scientists have discovered a new species of butterfly in the Amazon rainforest."
Category: Science/Nature

Text: "The latest smartphone from Apple features an improved camera and longer battery life."
Category: Technology

Text: {{Insert text here}}
Category:

It has been shown that adding a few examples in this way, commonly known as the few-shot learning approach, can generally improve prediction performance reliably. One such study can be found here.

Retrieval augmented Understanding

Most LLMs have a fixed-size context length that your entire prompt must fit in, so there are only so many examples that you can include. Besides, the cost of inference is proportional to the length of the prompt, so you might not want to include every example you have anyway. So now the question is what examples you include in the prompt.

Instead of statically decide what examples to include, one can pick examples dynamically based on the input text. This trick is commonly known as retrieval augmented generation. There are many research on how to select examples for text classification problems. Generally, we want to find labeled example that is close to input text, and output span over many different classes so LLM can get a unbiased help. Furthermore, it is also important to use vector search based embedding and keyword search together to define closeness between user query and example input.

Instruction fine-tuning

There are many research has show that the performance of text classification can be further improved with instruction fine-tuning, even when you do retrieval augmented understanding with a strong instruction tuned LLM like GPT-4o.

Compared to fine-tuning with a head on encoders, instruction fine-tuning has some advantages: we get to use all the existing tricks including prompting, in-context learning, and retrieval augmentation. In addition, instead of needing one head for each task, we can encode all the different tasks into the same text generation format and train a single model for them. It is also acceptable to use parameter-efficient methods such as LoRA, so instruction fine-tuning can be done cheaply.

Parting words

Before large language models (LLMs), dialog understanding was addressed with traditional shallow NLU (natural language understanding) models, which required sizable manually labeled examples for training shallow models (logistic regression and conditional random fields). Thus, this was a significant part of the work involved in building chatbots.

While head-based fine-tuning already improved developer experience by reducing the number of labeled examples needed to build NLU solutions, prompts provide zero-shot learning, in-context learning provides few-shot learning, and retrieval augmentation provides non-parametric learning for hot fixes. Instruction fine-tuning further improves the accuracy of NLU solutions. This new set of tools presents the best developer experience we know so far when it comes to NLU solutions. And better yet, there already opensource package for this.

Reference:

  1. https://aclanthology.org/2021.naacl-main.208.pdf
  2. https://arxiv.org/pdf/2403.17661
  3. https://arxiv.org/pdf/2401.11624v3
  4. https://openreview.net/pdf?id=7hSVLwNbWT

--

--