Note: My blog posts are (still) 100% handwritten, without any contribution from an AI model.
Large language models (LLMs) are very powerful, as ChatGPT has demonstrated very successfully. In this post, I will document some caveats, tips and pointers if you are interested in utilizing an LLM for an enterprise use-case.
Scope and Audience
This is not a tutorial; that would be too large in scope to fit into a blog post. Instead, I outline a workflow and note important questions and considerations at each step of the workflow.
The target audience is:
- an ML decision-maker looking for strategic guidance (Step 1a), and
- an ML practitioner with technical expertise looking for execution tips (Steps 1a through 3)
Assumptions
First, I assume that the LLM can fit into a single GPU at inference time, i.e. we do not need multiple GPUs when running the final model.
Second, I assume that the use-case requires fine-tuning the LLM, i.e. it is not simple enough to work with zero-shot or few-shot learning.
Third, I assume that we can fine-tune with multiple GPUs on a single cloud instance, i.e. we do not need multiple instances for fine-tuning the model.
The first and third of these are strong assumptions, but they are reasonable for enterprise use-cases with limited datasets.
Finally, LLM space is evolving rapidly. This post is current as of April 2023 and I’m assuming that you are not too far out from the time of this writing.
Full Workflow
For an enterprise company looking to use LLMs, a typical workflow looks as follows.
Step 0: Check if you need an LLM.
Step 1a: Decide on a model family and size, estimate hardware requirements and running costs.
Step 1b: Verify model with initial deployment.
Step 2: Fine-tune the model for your task and validate manually on a test set.
Step 3: Deploy the fine-tuned model.
Next, I will add some notes for each step of the process above.
Step 0/3: Do you need an LLM?
The following questions are worth asking to determine whether you need an LLM.
-
Do you have a business use-case that needs natural-language understanding? Examples are summarization, translation and question-answering.
-
LLMs work best when fine-tuned on a custom natural-language dataset. Do you already have a dataset you can use to fine-tune the LLM?
-
Have you (or your ML team) tried a smaller language model? A model with millions (not billions) of parameters may do the job at lower cost. This depends on the size and complexity of your dataset.
If you have decided that you need an LLM, proceed to the next step.
Step 1a/3: Deciding on an LLM and estimating running costs
Here are some constraints that can help you narrow down on an LLM.
Are the model weights open? Is the licensing OK?
You will need a model whose weights are open-source, as well as business-friendly licensing terms. ChatGPT is not open-source as of this writing, for example. Meta LLaMA is open-source, but closed for commercial use. Google FLAN-T5 and BLOOM come with lenient licensing terms and can be good candidates. However, BLOOM is huge at 170 billion parameters, so that leaves us with the FLAN family, as of April 2023.
What is the size of the model?
Model size is the number of parameters. Within a model family, multiple sizes can exist, such as FLAN-Base, Large, X-Large and XX-Large. It is best to start with the smallest model in the family. Later in this section, I will describe how to estimate memory required for a given model.
Also important is the context window size of the model; this corresponds to its “memory” when processing text. For example, GPT-3.5 context window is 4096 tokens or roughly 3000 English words. FLAN default is 512 tokens.
What is the best model architecture for your use-case?
Summarization and question-answering can be done by decoder-only models such as GPT or XLNet. Sentiment detection can be done by encoder-only models such as BERT. Encoder-decoder models such as BART or T5 can do both of these and are especially good at language translation.
Does the model need one GPU or multiple GPUs?
If the model can fit into a single GPU memory, it becomes architecturally simpler. If the model is too big for a single GPU, additional software is needed to distribute the model. This makes the architecture complex. As of this writing, 16GB is the highest memory for T4 GPUs, which are built for inference workloads.
Multiple GPUs are very likely still needed for fine-tuning the LLM. For such distributed training, libraries such as Hugging Face Accelerate and Microsoft DeepSpeed are required.
Do you know the right GPU type for training and inference?
Although AMD also makes GPUs, NVIDIA is the market leader. Some GPUs are specialized for inference workloads, such as NVIDIA T4. Some others such as NVIDIA A100 and V100 can do both training and inference. You should consider the right GPU depending on whether you are fine-tuning the model or running inference on the model.
How often will the model run?
Do you plan to run the model on a periodic basis, or do you want to run it all the time? If you only want to run it every now and then, can you tolerate some startup delay? Answers to these questions will determine where to run the model.
How much will it cost to run the model?
LLMs require a lot of GPU memory. Estimating this will inform the instance type you will need to run the model. From there, it is straightforward to estimate the cost to run the model.
Estimating Hardware Requirement
In order to estimate memory required at inference time for a model, first check whether the model is fp32 (4 bytes per parameter) or fp16 (2 bytes). For example, FLAN family is fp32, but LLaMA is fp16. At XLarge size, FLAN is 3 billion parameters. Hence this model would need 3 billion parameters x 4 bytes / parameter = 12GB of memory. This can fit into a T4 GPU which has 16GB memory.
Note that the model is first loaded into system memory, then to the GPU. This means the instance should have enough RAM to load the model, even if the model is ultimately moved to the GPU. For example, on AWS, T4 GPUs can come with 8GB memory (g4dn.xlarge) and 16GB memory (g4dn.2xlarge) systems. However, the 8GB memory system will not load a 12GB model, even though the GPU has 16GB of memory.
It is possible to reduce the memory requirement by using quantization techniques. LLM.int8() can reduce the memory by 75% for an fp32 model, or 50% for an fp16 model. Although the paper says the quantization gives no loss in performance, it is best to empirically verify the claim on your dataset.
Additional memory is required for the input and output data, CUDA kernel, etc. This needs to be determined empirically, but for estimation purposes, we can add a buffer of 20% on top of the calculation above.
To summarize, here is a formula that can help you estimate the memory required for your model:
# memory estimation for inference only
mem_per_param = 4 # applies for fp32, use 2 for fp16
additional_mem_buffer = 0.2 # 20% buffer memory
total_mem = num_model_params * mem_per_param * (1 + additional_mem_buffer)
Fine-tuning needs a lot more memory; we will discuss it in the fine-tuning section.
Estimating Running Cost
Now that we have a first estimate for our hardware, let us next try to calculate the running cost.
First, let us consider the case where you want to run the model all the time.
Suppose you have an ML-serving solution and you only want to know the hardware cost. (ML-serving solutions provide an API endpoint and queue up or distribute inference queries to a back-end farm.)
Once you know the memory required, as estimated previously, you can then look up the pricing tables on your cloud provider to determine an appropriate instance type. For example, for the FLAN-XLarge model we chose above, on AWS g4dn.2xlarge is the smallest instance type with enough system- and GPU-memory. It costs $0.75/hr on-demand. If you choose to keep the instance running at all times, you can reserve at $0.47/hr. Your annual cost then would be $4.1K.
Alternatively, if you don’t have an ML-serving solution, or if you will only use the model infrequently, you should consider serverless inference service, such as SageMaker. They have specific solutions at varying cost levels for models that cannot tolerate delay, or models that run for a long time. I have seen that they are typically 20-30% costlier than bare hardware. Again, it’s OK to pay the overhead if you don’t expect to use the model heavily. Having a scalable, stable and responsive ML inference solution requires non-trivial amounts of engineering.
Note that even if you use an inference service, it is still worth calculating the cost of running the model yourself, to justify with numbers whether it is worth using an inference service.
Step 1b/3: Verify that your chosen LLM can actually run on your selected hardware
At this point, it is worth going hands-on. Run the chosen model on the selected instance type and verify that it works as expected.
If you’re doing this on AWS, Amazon provides LMI containers with relevant deep learning libraries and NVIDIA GPU drivers preloaded.
Instead of writing your inference code from scratch, use already-written, working code if possible. (Inference example for FLAN-T5 XL)
You only need to load the model and run prediction on a representative prompt. Note inference latencies, maximum inference batch size, etc.
If this step is successful, you should send a report out and get buy-in from relevant stakeholders on the chosen model, its inference performance metrics and estimated running cost. It is possible that the model can still fail to do a very good job after the next step, fine-tuning. Be sure to call out your report as tentative.
Step 2/3: Fine-tune your LLM
After deciding on a model that can be run within the allowed budget, you will next fine-tune the model for your enterprise use-case. As mentioned earlier, we will need a dataset for fine-tuning the LLM. The dataset should have the relevant prompt as well.
Fine-tuning the model will need multiple high-end GPUs, the A100s as of this writing. A100s are expensive ($32/hr on-demand as of March 2023). However, depending on how often you wish to update the model, this is a one-time or less frequent cost. After fine-tuning is done, running the model is cheapest on T4 GPUs ($0.53/hr as of March 2023).
As of this writing, AWS does not allow renting of individual A100s, instead giving them in groups of 8s. This makes AWS extra expensive for fine-tuning. On Azure and Google Cloud, you can rent 1, 2 or 4 A100s and thereby pay less.
Because fine-tuning involves multiple A100s, you will need libraries for distributed training. As of this writing, Accelerate and DeepSpeed are libraries that help with this. NVIDIA also has a library called Megatron, but I have not tried it.
With FLAN models, you can increase the context window when fine-tuning. For example, this applies if you have larger documents to summarize.
To reduce memory requirement when fine-tuning, use bf16 data type. This datatype is not supported in V100 GPUs.
It is best to start with known working code if possible (fine-tuning example for FLAN-T5 XL), and adapt it to your custom dataset. I started with one A100 and kept adding GPUs until it no longer runs out of memory. I could fine-tune my dataset in 3 hours.
I don’t know of any formula for theoretically estimating how much memory is needed for fine-tuning. It depends on batch size, selected optimizer, and intermediate state (activations and gradients).
When fine-tuning is complete, you can take the artifacts from the final checkpoint as the fine-tuned model.
It is worth running the model on the test set and checking its performance manually. Are the results good enough on a set of sample prompts? Use visual inspection and look for hallucinations, false claims, or misleading results. This bit is still very much manual.
Step 3/3: Running the model
If the model works to your satisfaction, it is time to deploy it.
Deployment setup and steps vary from company to company. If you have an in-house ML infrastructure team, you should consult them on how to deploy your model.
If you use SageMaker directly, use the appropriate service depending on your model type. We have discussed this earlier, under “Estimating Running Cost” section.
Be sure to collect user feedback and performance metrics on how the model is working in practice.
Conclusion
In this post, we looked at a workflow to bring LLMs for a business use-case. We discussed pertinent questions to select a model and estimate its running costs. We then noted some considerations while fine-tuning the model. Finally, we mentioned choices of deployment based on our model characteristics.
For an academic introduction to large language models, I recommend lecture notes from Stanford CS324.