Optimizing your LLM in production

Note : This blog post is also available as a documentation page on Transformers .

Large Language Models (LLMs) such as GPT3/4, Falcon , and LLama are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries. Deploying these models in real-world tasks remains challenging, however:

To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see Kaplan et al , Wei et. al ). This consequently amplifies the memory demands for inference.
In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference.

The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.

In this blog post, we will go over the most effective techniques at the time of writing this blog post to tackle these challenges for efficient LLM deployment:

Lower Precision : Research has shown that operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance.
Flash Attention: Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.
Architectural Innovations: Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are Alibi , Rotary embeddings , Multi-Query Attention (MQA) and Grouped-Query-Attention (GQA) .

Throughout this notebook, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. While doing so, we run practical examples showcasing each of the feature improvements.

1. Harnessing the Power of Lower Precision

Memory requirements of LLMs can be best understood by seeing the LLM as a set of weight matrices and vectors and the text inputs as a sequence of vectors. In the following, the definition weights will be used to signify all model weight matrices and vectors.

At the time of writing this post, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. 4.5689 which is usually stored in either float32 , bfloat16 , or float16 format. This allows us to easily compute the memory requirement to load the LLM into memory:

Loading the weights of a model having X billion parameters requires roughly 4 * X GB of VRAM in float32 precision

Nowadays, models are however rarely trained in full float32 precision, but usually in bfloat16 precision or less frequently in float16 precision. Therefore the rule of thumb becomes:

Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision

For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM.

To give some examples of how much VRAM it roughly takes to load a model in bfloat16:

GPT3 requires 2 * 175 GB = 350 GB VRAM
Bloom requires 2 * 176 GB = 352 GB VRAM
Llama-2-70b requires 2 * 70 GB = 140 GB VRAM
Falcon-40b requires 2 * 40 GB = 80 GB VRAM
MPT-30b requires 2 * 30 GB = 60 GB VRAM
bigcode/starcoder requires 2 * 15.5 = 31 GB VRAM

As of writing this document, the largest GPU chip on the market is the A100 offering 80GB of VRAM. Most of the models listed before require more than 80GB just to be loaded and therefore necessarily require tensor parallelism and/or pipeline parallelism .

🤗 Transformers does not support tensor parallelism out of the box as it requires the model architecture to be written in a specific way. If you're interested in writing models in a tensor-parallelism-friendly way, feel free to have a look at the text-generation-inference library .

Naive pipeline parallelism is supported out of the box. For this, simply load the model with device="auto" which will automatically place the different layers on the available GPUs as explained here . Note, however that while very effective, this naive pipeline parallelism does not tackle the issues of GPU idling. For this more advanced pipeline parallelism is required as explained here .

If you have access to an 8 x 80GB A100 node, you could load BLOOM as follows

!pip install transformers accelerate bitsandbytes optimum

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", pad_token_id=0)

By using device_map="auto" the attention layers would be equally distributed over all available GPUs.

In this notebook, we will use bigcode/octocoder as it can be run on a single 40 GB A100 GPU device chip. Note that all memory and speed optimizations that we will apply going forward, are equally applicable to models that require model or tensor parallelism.

Since the model is loaded in bfloat16 precision, using our rule of thumb above, we would expect the memory requirement to run inference with bigcode/octocoder to be around 31 GB VRAM. Let's give it a try.

We first load the model and tokenizer and then pass both to Transformers' pipeline object.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=0)
tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"

result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]
result

Output :

Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single

Nice, we can now directly use the result to convert bytes into Gigabytes.

def bytes_to_giga_bytes(bytes):
  return bytes / 1024 / 1024 / 1024

Let's call torch.cuda.max_memory_allocated to measure the peak GPU memory allocation.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output :

29.0260648727417

Close enough to our back-of-the-envelope computation! We can see the number is not exactly correct as going from bytes to kilobytes requires a multiplication of 1024 instead of 1000. Therefore the back-of-the-envelope formula can also be understood as an "at most X GB" computation. Note that if we had tried to run the model in full float32 precision, a whopping 64 GB of VRAM would have been required.

Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if your GPU supports bfloat16 . Float32 won't give better inference results than the precision that was used to train the model.

If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under "torch_dtype" , e.g. here . It is recommended to set the model to the same precision type as written in the config when loading with from_pretrained(..., torch_dtype=...) except when the original type is float32 in which case one can use both float16 or bfloat16 for inference.

Let's define a flush(...) function to free all allocated memory so that we can accurately measure the peak allocated GPU memory.

del pipe
del model

import gc
import torch

def flush():
  gc.collect()
  torch.cuda.empty_cache()
  torch.cuda.reset_peak_memory_stats()

Let's call it now for the next experiment.

flush()

In the recent version of the accelerate library, you can also use an utility method called release_memory()

from accelerate.utils import release_memory
# ...

release_memory(model)

Now what if your GPU does not have 32 GB of VRAM? It has been found that model weights can be quantized to 8-bit or 4-bits without a significant loss in performance (see Dettmers et al. ). Model can be quantized to even 3 or 2 bits with an acceptable loss in performance as shown in the recent GPTQ paper 🤯.

Without going into too many details, quantization schemes aim at reducing the precision of weights while trying to keep the model's inference results as accurate as possible ( a.k.a as close as possible to bfloat16). Note that quantization works especially well for text generation since all we care about is choosing the set of most likely next tokens and don't really care about the exact values of the next token logit distribution. All that matters is that the next token logit distribution stays roughly the same so that an argmax or topk operation gives the same results.

There are various quantization techniques, which we won't discuss in detail here, but in general, all quantization techniques work as follows:

Quantize all weights to the target precision

Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision

Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision

Quantize the weights again to the target precision after computation with their inputs.

In a nutshell, this means that inputs-weight matrix multiplications, with X X X being the inputs , W W W being a weight matrix and Y Y Y being the output:

Y = X ∗ W Y = X * W Y = X ∗ W

are changed to

Y = X ∗ dequantize ( W ) ; quantize ( W ) Y = X * \text{dequantize}(W); \text{quantize}(W) Y = X ∗ dequantize ( W ) ; quantize ( W )

for every matrix multiplication. Dequantization and re-quantization is performed sequentially for all weight matrices as the inputs run through the network graph.

Therefore, inference time is often not reduced when using quantized weights, but rather increases. Enough theory, let's give it a try! To quantize the weights with Transformers, you need to make sure that the bitsandbytes library is installed.

!pip install bitsandbytes

We can then load models in 8-bit quantization by simply adding a load_in_8bit=True flag to from_pretrained .

model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_8bit=True, pad_token_id=0)

Now, let's run our example again and measure the memory usage.

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]
result

Output :

Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single

Nice, we're getting the same result as before, so no loss in accuracy! Let's look at how much memory was used this time.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output :

15.219234466552734

Significantly less! We're down to just a bit over 15 GBs and could therefore run this model on consumer GPUs like the 4090. We're seeing a very nice gain in memory efficiency and more or less no degradation to the model's output. However, we can also notice a slight slow-down during inference.

We delete the models and flush the memory again.

del model
del pipe

flush()

Let's see what peak GPU memory consumption 4-bit quantization gives. Quantizing the model to 4-bit can be done with the same API as before - this time by passing load_in_4bit=True instead of load_in_8bit=True .

model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, low_cpu_mem_usage=True, pad_token_id=0)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):]
result

Output :

Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n    return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument

We're almost seeing the same output text as before - just the python is missing just before the code snippet. Let's see how much memory was required.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output :

9.543574333190918

Just 9.5GB! That's really not a lot for a >15 billion parameter model.

While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full bfloat16 inference. It is up to the user to try it out.

Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to quantize \text{quantize} quantize and dequantize \text{dequantize} dequantize taking longer during inference.

del model
del pipe

flush()

Overall, we saw that running OctoCoder in 8-bit precision reduced the required GPU VRAM from 32G GPU VRAM to only 15GB and running the model in 4-bit precision further reduces the required GPU VRAM to just a bit over 9GB.

4-bit quantization allows the model to be run on GPUs such as RTX3090, V100, and T4 which are quite accessible for most people.

For more information on quantization and to see how one can quantize models to require even less GPU VRAM memory than 4-bit, we recommend looking into the AutoGPTQ implementation.

As a conclusion, it is important to remember that model quantization trades improved memory efficiency against accuracy and in some cases inference time.

If GPU memory is not a constraint for your use case, there is often no need to look into quantization. However many GPUs simply can't run LLMs without quantization methods and in this case, 4-bit and 8-bit quantization schemes are extremely useful tools.

For more in-detail usage information, we strongly recommend taking a look at the Transformers Quantization Docs . Next, let's look into how we can improve computational and memory efficiency by using better algorithms and an improved model architecture.

2. Flash Attention: A Leap Forward

Today's top-performing LLMs share more or less the same fundamental architecture that consists of feed-forward layers, activation layers, layer normalization layers, and most crucially, self-attention layers.

Self-attention layers are central to Large Language Models (LLMs) in that they enable the model to understand the contextual relationships between input tokens. However, the peak GPU memory consumption for self-attention layers grows quadratically both in compute and memory complexity with number of input tokens (also called sequence length ) that we denote in the following by N N N . While this is not really noticeable for shorter input sequences (of up to 1000 input tokens), it becomes a serious problem for longer input sequences (at around 16000 input tokens).

Let's take a closer look. The formula to compute the output O \mathbf{O} O of a self-attention layer for an input X \mathbf{X} X of length N N N is:

O = Attn ( X ) = V × Softmax ( Q K T ) with Q = W q X , V = W v X , K = W k X \textbf{O} = \text{Attn}(\mathbf{X}) = \mathbf{V} \times \text{Softmax}(\mathbf{QK}^T) \text{ with } \mathbf{Q} = \mathbf{W}_q \mathbf{X}, \mathbf{V} = \mathbf{W}_v \mathbf{X}, \mathbf{K} = \mathbf{W}_k \mathbf{X} O = Attn ( X ) = V × Softmax ( QK T ) with Q = W q X , V = W v X , K = W k X X = ( x 1 , . . . x N ) \mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N}) X = ( x 1 , ... x N ) is thereby the input sequence to the attention layer. The projections Q \mathbf{Q} Q and K \mathbf{K} K will each consist of N N N vectors resulting in the Q K T \mathbf{QK}^T QK T being of size N 2 N^2 N 2 .

LLMs usually have multiple attention heads, thus doing multiple self-attention computations in parallel. Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the Q K T \mathbf{QK^T} Q K T matrices to be 40 ∗ 2 ∗ N 2 40 * 2 * N^2 40 ∗ 2 ∗ N 2 bytes. For N = 1000 N=1000 N = 1000 only around 50 MB of VRAM are needed, however, for N = 16000 N=16000 N = 16000 we would need 19 GB of VRAM, and for N = 100 , 000 N=100,000 N = 100 , 000 we would need almost 1TB just to store the Q K T \mathbf{QK}^T QK T matrices.

Long story short, the default self-attention algorithm quickly becomes prohibitively memory-expensive for large input contexts.

As LLMs improve in text comprehension and generation, they are applied to increasingly complex tasks. While models once handled the translation or summarization of a few sentences, they now manage entire pages, demanding the capability to process extensive input lengths.

How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the Q K T QK^T Q K T matrix. Tri Dao et al. developed exactly such a new algorithm and called it Flash Attention .

In a nutshell, Flash Attention breaks the V × Softmax ( Q K T \mathbf{V} \times \text{Softmax}(\mathbf{QK}^T V × Softmax ( QK T ) computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps:

O i ← s i j a ∗ O i + s i j b ∗ V j × Softmax ( Q K i , j T ) for multiple i , j iterations \textbf{O}_i \leftarrow s^a_{ij} * \textbf{O}_i + s^b_{ij} * \mathbf{V}_{j} \times \text{Softmax}(\mathbf{QK}^T_{i,j}) \text{ for multiple } i, j \text{ iterations} O i ← s ij a ∗ O i + s ij b ∗ V j × Softmax ( QK i , j T ) for multiple i , j iterations

with s i j a s^a_{ij} s ij a and s i j b s^b_{ij} s ij b being some softmax normalization statistics that need to be recomputed for every i i i and j j j .

Please note that the whole Flash Attention is a bit more complex and is greatly simplified here as going in too much depth is out of scope for this notebook. The reader is invited to take a look at the well-written Flash Attention paper for more details.

The main takeaway here is:

By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives numerical identical outputs compared to the default self-attention layer at a memory cost that only increases linearly with N N N .

Looking at the formula, one would intuitively say that Flash Attention must be much slower compared to the default self-attention formula as more computation needs to be done. Indeed Flash Attention requires more FLOPs compared to normal attention as the softmax normalization statistics have to constantly be recomputed (see paper for more details if interested)

However, Flash Attention is much faster in inference compared to default attention which comes from its ability to significantly reduce the demands on the slower, high-bandwidth memory of the GPU (VRAM), focusing instead on the faster on-chip memory (SRAM).

Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast on-chip SRAM memory instead of having to access the slower VRAM memory to compute the output vector O \mathbf{O} O .

In practice, there is currently absolutely no reason to not use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient.

Let's look at a practical example.

Our OctoCoder model now gets a significantly longer input prompt which includes a so-called system prompt . System prompts are used to steer the LLM into a better assistant that is tailored to the users' task. In the following, we use a system prompt that will make OctoCoder a better coding assistant.

system_prompt = """Below are a series of dialogues between various people and an AI technical assistant.
The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable.
The assistant is happy to help with code questions and will do their best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer.
That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful.

The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.

-----

Question: Write a function that takes two lists and returns a list that has alternating elements from each input list.

Answer: Sure. Here is a function that does that.

def alternating(list1, list2):
   results = []
   for i in range(len(list1)):
       results.append(list1[i])
       results.append(list2[i])
   return results

Question: Can you write some test cases for this function?

Answer: Sure, here are some tests.

assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
assert alternating([], []) == []

Question: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.

Answer: Here is the modified function.

def alternating(list1, list2):
   results = []
   for i in range(min(len(list1), len(list2))):
       results.append(list1[i])
       results.append(list2[i])
   if len(list1) > len(list2):
       results.extend(list1[i+1:])
   else:
       results.extend(list2[i+1:])
   return results

-----
"""

For demonstration purposes, we duplicate the system by ten so that the input length is long enough to observe Flash Attention's memory savings. We append the original text prompt "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"

long_prompt = 10 * system_prompt + prompt

We instantiate our model again in bfloat16 precision.

model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Let's now run the model just like before without Flash Attention and measure the peak GPU memory requirement and inference time.

import time

start_time = time.time()
result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]

print(f"Generated in {time.time() - start_time} seconds.")
result

Output :

Generated in 10.96854019165039 seconds.
Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n   return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef

We're getting the same output as before, however this time, the model repeats the answer multiple times until it's 60 tokens cut-off. This is not surprising as we've repeated the system prompt ten times for demonstration purposes and thus cued the model to repeat itself.

Note that the system prompt should not be repeated ten times in real-world applications - one time is enough!

Let's measure the peak GPU memory requirement.

bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

Output :

37.668193340301514

As we can see the peak GPU memory requirement is now significantly higher than in the beginning, which is largely due to the longer input sequence. Also the generation takes a little over a minute now.

We call flush() to free GPU memory for our next experiment.

flush()

For comparison, let's run the same function, but enable Flash Attention instead. To do so, we convert the model to BetterTransformers and by doing so enabling PyTorch's SDPA self-attention which in turn is based on Flash Attention.

model.to_bettertransformer()

Now we run the exact same code snippet as before and under the hood Transformers will make use of Flash Attention.

start_time = time.time()
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):]

print(f"Generated in {time.time() - start_time} seconds.")
result

Output :

Optimizing your LLM in production