Large Language Models (LLMs) have very strong general-purpose capabilities which allow them to solve complex problems. Many organizations use LLMs for a variety of use cases such as classifying text, generating creative content, answering questions and extracting knowledge.
In order to improve an LLM’s accuracy for a specific domain, AI engineers apply different techniques to increase an LLM’s awareness of its domain. These include prompt engineering and Retrieval Augmented Generation (RAG) to provide better context for the LLM.
Some AI engineers also fine-tune LLMs to improve performance on a specific task such as classifying user intents. However, these techniques are limited when trying to achieve full domain adaptation for an LLM. Full domain adaptation means that a model can tackle multiple domain-specific tasks at the same time. For example, having one model that can classify intents, detect sentiment and answer questions about a platform like Wix.
To achieve full domain adaptation of an LLM at Wix, we applied multi-task supervised fine-tuning (SFT) using adapters and domain adaptation using full weights fine-tuning (DAPT). These techniques require extensive data collection methods, advanced modeling processes and dedicated evaluation benchmarks.
In this post we introduce the challenges and opportunities of customizing LLMs and discuss how to achieve full domain adaptation. In upcoming posts and Wix meetup hosts on September 16th, we will share more details on how we built a custom LLM for Wix use cases with limited data and tokens.
Our smaller, customized Wix LLM showed better results than GPT 3.5 models on a variety of Wix tasks and opened the door for more impact in the organization. The posts will cover the main components of our method: evaluation, modeling and creating training datasets.
Watch Lior Sidi's talk: Customizing LLMs to Enterprise Data: The Wix Journey
The Generative AI Project Life Cycle
The generative AI project life cycle is a common paradigm taught in many tutorials and courses. In Figure 1 we present our extended version of it. Note that the Adapt & align model section includes the 3 main customization techniques: prompt engineering, augmentation using RAG and fine-tuning.
Figure 1: The expanded generative AI project life cycle
In Table 1 we mapped the advantages and disadvantages for each customization technique., The table shows the tradeoff between development complexity and the level of domain adaptation achieved for each technique. Multi-task supervised fine-tuning combined with full-weight fine-tuning allow better domain adaptation, but are limited to specific LLM families and require significant development efforts and extensive training datasets.
Prompt engineering, RAG and task-specific fine-tuning are more common and easier to apply. This may make it seem like these approaches are the obvious choice in many cases. However, it’s important to remember that these techniques have some fundamental limitations such as high cost, high latency, and model hallucination. These are caused by the following issues:
Lack of multitasking: Only one domain task can be handled at a time. Ideally, custom models should perform well on various domain tasks. This allows AI engineers to build simpler solutions without overloading the LLM with too many customizations.
Training data is not domain aware: The training data of common base LLMs’ does not necessarily contain all the relevant and up-to-date information about the domain. Moreover, flaws in the training data such as negative comments, wrong facts and typos. can harm the model’s knowledge. Using the internal data and knowledge in an organization to create high quality training datasets allows LLMs to be more domain aware.
Model size The huge size of common baseLLMs is useful for general understanding and creative tasks. However, model size also might reduce accuracy for customized tasks and increase cost and latency. Having a smaller model that ‘s specially fine-tuned for domain tasks solved these problems.
Prompt complexity: - Prompt engineering and RAG require detailed prompts that include a lot of organization context. Longer prompts mean more tokens which increases cost. and too detailed guidelines in the prompt can cause overfitting and harm accuracy.
Overfitting: The prompt fine-tuning services provided by LLM vendors doesn’t actually help in achieving cross-domain capabilities. In many cases, it simply overfits the specific prompt. Therefore prompt fine-tuning is not relevant when customizing LLMs for multitasking.
Table 1: Customization technique pros and cons
Full domain adaptation using fine-tuning
As LLM development progresses, the techniques for customizing LLMs for specific domains are advancing. The dominant techniques today are to fine-tune LLMs using domain adaptation (DAPT) and supervised fine-tuning (SFT). Both of the techniques have already been applied to common domains such as finance, law, chip design and more.
In DAPT we train the LLM from scratch or continue the training of the pre-trained LLM weights on domain-specific text. For SFT, we usually apply instructive training on domain-specific tasks using adapters.
When doing domain adaptation for new “smaller” domains, the story does not end with the training approach. A lot of work is required to understand the training hyperparameters, curate extensive domain-specific data, deeply understand the domain’s knowledge sources, and develop benchmarks and evaluations for tasks. Collecting training data is especially difficult for domains with small amounts of custom data since DAPT requires datasets with billions of tokens.
The components of full domain adaptation
So, you decided to customize your LLM? Great! But where do you start? What’s the process?
When we first started the project at Wix we were overwhelmed with the amount of content and noise around fine-tuning. We had to go back to the fundamental data science principles to address the challenge. The 3 fundamental components of every data science project are training data, modeling and evaluation. Having these three components in mind provides a great starting point for dividing up the project and allowing focus.
Figure 2: The fine-tuning process at Wix
Evaluation
Every data science project should start with evaluation. Understanding a model’s goals will help us to make better decisions when building the model and preparing datasets. Open LLM benchmarks are a great way to estimate general purpose capabilities. However, for custom models we need custom benchmarks to estimate how knowledgeable a model is about the domain and how good it is at solving a variety of the domain tasks.
These custom benchmarks are especially handy when evaluating other vendor’s LLMs as some of them might be overfitting.
To estimate the knowledge of the LLM we built a custom Wix Question and Answers (Q&A) dataset. The question and answer data was taken from existing customer service live chats and FAQs. Because the answer (the label) in this case is free text, we applied the LLM-as-a-judge technique as presented in Figure 3.
The “judge” is a prompt that compares answers suggested by an LLM to the ground-truth one. After assessing the performance of several open-source LLMs as judges, we decided to build our own prompt for estimating answer quality. This prompt outperformed all other open solutions because our experts had better domain experience. Having a solid, reliable metric is a must! Otherwise you are just shooting in the air.
For task capabilities estimation, we used common, domain-specific, text-based learning tasks. In Wix’s case, these included customer’s intent classification, customer segmentation, custom domain summarization, and sentiment analysis.
We also combined knowledge and task-based evaluation using a technique presented by Microsoft. They turned the Q&A task from free-text to multiple choice question-answering. This was done using the correct answer for a question to generate 3 other options: one answer which is similar but slightly wrong two completely wrong answers. The task of the model is to choose the right option. This allows us to estimate the model’s knowledge as well as its ability to follow instructions.
For task evaluation, it’s important to decide on one fixed prompt per task. This should be a simple prompt that hasn’t been optimized for specific LLM model families.
Figure 3: Development process for LLM-as-a-udge
Training data
Training data is the core element of every data science project. This is especially true for LLMs which require a lot of good data. The tricky part with LLMs is that they can learn everything, including typos, curses and confidential information. That's why we invested in creating training data that’s high in both quality and quantity.
The industry best practice is to use billions of tokens of data for full LLM fine-tuning. However, we must keep in mind that knowledge about many domains is already included in pre-trained LLMs. For example, all LLMs are already aware of website building and Wix products.
When designing training datasets, one of the most important hyperparameters for fine-tuning is sampling between data sources.
In addition, in order to maintain a pre-trained LLM’s performance on common knowledge, the model must also train on public data. The goal of domain-specific fine-tuning is to increase the knowledge around a specific domain. This can be done by increasing the ratio of domain-specific data in the LLM’s training to 2% as opposed to the 0.00000001% in the pre-trained model.
Completion-based training data should be raw free text, such as articles, technical documentation, and real-world dialogues. Using internal, unprocessed data is problematic as it may hold mistakes and confidential information.
For the instructive training data we used labeled data from other NLP projects at Wix such as sentiment analysis and customer intent.
The amount of manually created data was still very limited. Therefore, in order to increase the dataset, we synthetically generated Q&As using organizational data such as knowledge base articles, customer support chats, technical documentation, and internal reports. We also generated reading comprehension tasks as discussed above.
Modeling
When it comes to LLM modeling, we didn’t want to reinvent the wheel. Each LLM family has already developed its own optimized training recipes.
When modeling and LLM, we recommend choosing one LLM family to start, based on these considerations:
Choose LLMs that have already been evaluated for domain adaptation with common training approaches (DAPT & SFT). Proprietary LLMs are not a good fit for this since their training recipes are not publicly available..
Choose an LLM model that performs well on the benchmarks you created. Good performance may indicate that it already holds some domain information. In this case, you may be able to use a lot fewer domain-specific tokens for training. For example, Wix and website building in general are already known to all commonly available LLMs. We chose the ones that already knew us the best which eliminated the need to fine-tune the model from scratch.
Keep in mind that if your benchmark tasks don’t require long context and output, you might consider using small language models (SLM) such as T5 which are easier to train and serve.
For training processes you should be aware for the following hyperparameters:
Completion vs full-prompt training - When training the model, you can either fit the model on the next word predictions after the context or you can also fit on the input prompt as well. Having the input prompt as part of the training allows the model to learn from more tokens about the domain, but it can result in the model producing repetitive text.
LoRA Rank - SFT uses adapters for training. The adapter rank is the main hyperparameter in this case and reflects the complexity and amount of knowledge to learn. In cases of small training datasets you should use a lower rank adapter
Use only SFT - If you don’t have enough completion data and/or you have limited computational resources, you should avoid DAPT fine-tuning on the entire network. You can use a higher adapter rank and fit it for both completion and tasks.
Infrastructure - Training LLMs requires dedicated powerful GPU machines. If you don’t have a local GPU, you should decide on a cloud provider with available resources. We decided to limit our work to one high-power GPU running on AWS P5 that allowed us to experiment with full-scale fine-tuning and LoRA with high ranks on LLaMa2 7b.
Summary
LLM domain adaptation with fine-tuning requires a lot of data development effort. Investing in exploring the field allows you to build good practices in estimating the performance of popular models. Through our journey , we realized that it’s worth having a customized model for a family of tasks such as code generation, Q&A, classification, knowledge extraction, and creative content generation.
The main tasks that truly require customization in our field are knowledge-based tasks such as Q&A with/without context. We realized this because our most common benchmark was Q&A and the LLMs and RAG solutions we were using were not performing as well as we wanted.
This post's goal was to give you an overview of the LLM domain adaptation world. In the next post and our meetup talk on September 16th we will share more detailed information, explanation and results from our custom model.
Hopefully it will help boost your journey in customizing your models. :)
Acknowledgments:
This project is the outcome of many people working together including AI researchers, engineers and AI curators: Irma Hadar, Sviatoslav Pykhnivskyi, Hagit Gur, Lin Burg, Stanislav Kovynov, Nataliia Hapanovych, Dvir Cohen, Olga Atzmon, and Gilad Barkan.
This post was written by Lior Sidi
More of Wix Engineering's updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel