Deploying Llama2 on A100 GPUs using vLLM

Meta’s Llama2 is a state of the art open weight, large language model that you can host yourself and use for commercial purposes. It’s open sourced weights and permissive commercial licensing mean that the open source community has jumped into improving it via fine tuned variants, quantization and other optimizations.

You will need to request access to the model via (Meta’s signup form)[https://ai.meta.com/resources/models-and-libraries/llama-downloads/] and once granted access you will be able to download it via huggingface. Meta’s documentation suggests serving via torch serve or text generation inference, however we are going to use the superpower that is the open source community - vLLM.

vLLM is a distributed inference and serving library, which provides:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server

The key features we are interested in are PagedAttention and Continuous batching of incoming requests.

PagedAttention, is a memory optimization technique, based on the classic idea of virtual memory and paging in operating systems. It allows the system to batch more sequences together, increase GPU utilization, and thereby significantly increase the throughput as shown in the performance result above. There is a great writeup about it on the (vLLM blog)[https://vllm.ai/] and its (academic paper)[https://arxiv.org/abs/2309.06180]

Continuous batching of incoming requests, implements iteration-level scheduling of inference batches, yielding higher GPU utilization and (23x throughput in LLM inference)[https://www.anyscale.com/blog/continuous-batching-llm-inference]

We are going to use runpod.io to run LLAMA2 70b - you need 160GB of VRAM, so either 2xA100 80GB gpus or 4xA100 40GB gpus. Runpod.io comes with a preinstalled environment containing Nvidia drivers and configures a reverse proxy to server https over selected ports.

1) Generate a hugging face token

2) Spin up a machine 2xA100 80GB, configure enough disk space to download LLAMA2 (suggested 400GB disk space), and configure a port to serve and proxy on (.e.g. 8000)

3) SSH into your machine and run

        
pip install --upgrade huggingface_hub vllm

huggingface-cli login --token your_token

nvidia-smi We can then test loading a model in a python shell, set the tensor_parallel_size to the number of GPU’s you have.

        
      
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=2)

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

This will download the model off hugging face, save it to disk, load it to gpu and run inference. You can measure the GPU memory usage and utilization using nvidia-smi

To serve the inference as an endpoint, vLLM provides a fastapi server as a template. It can be started as

        
      
python -m vllm.entrypoints.api_server --host 0.0.0.0 --model "meta-llama/Llama-2-70b-chat-hf" --tensor-parallel-size 2

and tested as

        
      
curl https://0gwma6jvrcbjza-8000.proxy.runpod.net/generate \
    -d '{
        "prompt": "San Francisco is a",
        "use_beam_search": true,
        "n": 4,
        "temperature": 0
    }'

Note, this endpoint doesn’t have any authentication built in, however as its a fastapi app, it should be fairly straightforward to add token or JWK auth.

Deploying Llama2 on A100 GPUs using vLLM

Further Reading

Using GPT4 to generate git logs for OpenSource projects in the style of conventional commits via a terminal

ML in PL Workshop, Generative methods in drug discovery, a practical introduction

Code and Coffee Meeetup - Notes on LLM tokenizers