Sitemap

Dialog Understanding 12: Multi-Task or Single-Task Instruction Tuning?

Sean Wu
3 min readOct 7, 2024
generated by gemini

When building a conversational user interface for real-world applications, it quickly becomes clear that beyond intent detection and slot filling, there are other crucial dialogue understanding tasks that need attention. For instance:

Bot: Are you sure you want to book a ride to Central Park now?
User: I like to take a walk, the weather is so good.

In this scenario, the bot should be able to interpret the user’s response as a “yes” to the original question. Addressing such nuances requires advanced NLP techniques, and instruction tuning is often a powerful solution. The key question then arises: should we fine-tune a single model to handle all tasks, or should we fine-tune separate models for each task? What are the factors?

Serving

One of the main advantages of using a single model for full fine-tuning is that we only need to deploy a single model to serve all tasks. In an age where even small models have billions of parameters, this can result in significant savings in GPU memory. However, recent research shows that Low-Rank Adaptation (LoRA) tuning can achieve similar performance to full fine-tuning while using only a fraction of the parameters. Additionally, popular inference libraries like vLLM can efficiently serve hundreds of LoRA models trained on the same base model simultaneously, making serving no longer a problem.

Transferable Learning

Before the rise of large language models (LLMs), multi-task learning was a useful strategy, as it pushed models to learn representations that benefited multiple tasks, leading to better generalization to new tasks within the same domain. Since labeled datasets are expensive to build and tend to be small, finding related tasks and using labeled datasets for those tasks could potentially improve performance.

With modern LLMs, however, labeled datasets used during fine-tuning are often dwarfed by pretraining datasets, sometimes by several orders of magnitude. The trillions of tokens used during pretraining have greatly reduced the need to focus on learning representations from scratch. Additionally, dialogue understanding tasks do not overlap significantly, so there is limited benefit from adopting multi-task learning in this context.

Mixing dataset is hard

When we do multi-task learning, one of the questions that arises is how to mix datasets from different tasks. The naturally available datasets for different tasks often vary in size, and tasks also differ in difficulty. When we mix these datasets, it’s easy for tasks with larger datasets to overshadow those with smaller ones, leading to subpar performance on tasks with smaller datasets. On the other hand, if we oversample smaller datasets, we might unnecessarily increase the time and cost required for fine-tuning, making the process inefficient from an engineering perspective.

Divide and conquer

When we introduce a new task or find that the performance of a specific task needs improvement, adopting multi-task learning can be cumbersome. This is because we would need to retrain the entire model, going through the same balancing act between different tasks to ensure there is no performance regression with the new data.

Parting words

When we build the second version DU module, the decision on whether whether one should do multi-task instruction learning or single-task one is not as clean. At the time, how to serving Lora efficiently is not widely known. But with Slora is out, particularly when all the leading inference library added support to it, the decision slowly become apparent. A “divide and conquer” approach, where models are fine-tuned separately for specific tasks, offers more flexibility and precision without the burden of retraining an entire model.

Reference:

https://arxiv.org/pdf/2308.10792 (A survey on instruction tuning)

https://arxiv.org/abs/2308.06522 (Slora, efficient lora serving)

--

--

No responses yet