Meta’s Llama2 is a state of the art open weight, large language model that you can host yourself and use for commercial purposes. It’s open sourced weights and permissive commercial licensing mean that the open source community has jumped into improving it via fine tuned variants, quantization and other optimizations.
You will need to request access to the model via (Meta’s signup form)[https://ai.meta.com/resources/models-and-libraries/llama-downloads/] and once granted access you will be able to download it via huggingface. Meta’s documentation suggests serving via torch serve or text generation inference, however we are going to use the superpower that is the open source community - vLLM.
vLLM is a distributed inference and serving library, which provides:
-
State-of-the-art serving throughput
-
Efficient management of attention key and value memory with PagedAttention
-
Continuous batching of incoming requests
-
Optimized CUDA kernels
-
Seamless integration with popular HuggingFace models
-
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
-
Tensor parallelism support for distributed inference
-
Streaming outputs
-
OpenAI-compatible API server
The key features we are interested in are PagedAttention and Continuous batching of incoming requests.
PagedAttention, is a memory optimization technique, based on the classic idea of virtual memory and paging in operating systems. It allows the system to batch more sequences together, increase GPU utilization, and thereby significantly increase the throughput as shown in the performance result above. There is a great writeup about it on the (vLLM blog)[https://vllm.ai/] and its (academic paper)[https://arxiv.org/abs/2309.06180]
Continuous batching of incoming requests, implements iteration-level scheduling of inference batches, yielding higher GPU utilization and (23x throughput in LLM inference)[https://www.anyscale.com/blog/continuous-batching-llm-inference]
We are going to use runpod.io to run LLAMA2 70b - you need 160GB of VRAM, so either 2xA100 80GB gpus or 4xA100 40GB gpus. Runpod.io comes with a preinstalled environment containing Nvidia drivers and configures a reverse proxy to server https over selected ports.
1) Generate a hugging face token
2) Spin up a machine 2xA100 80GB, configure enough disk space to download LLAMA2 (suggested 400GB disk space), and configure a port to serve and proxy on (.e.g. 8000)
3) SSH into your machine and run
1
2
3
pip install --upgrade huggingface_hub vllm
huggingface-cli login --token your_token
nvidia-smi We can then test loading a model in a python shell, set the tensor_parallel_size to the number of GPU’s you have.
1
2
3
4
5
6
7
8
9
10
11
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=2)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
This will download the model off hugging face, save it to disk, load it to gpu and run inference. You can measure the GPU memory usage and utilization using nvidia-smi
To serve the inference as an endpoint, vLLM provides a fastapi server as a template. It can be started as
1
python -m vllm.entrypoints.api_server --host 0.0.0.0 --model "meta-llama/Llama-2-70b-chat-hf" --tensor-parallel-size 2
and tested as
1
2
3
4
5
6
7
curl https://0gwma6jvrcbjza-8000.proxy.runpod.net/generate \
-d '{
"prompt": "San Francisco is a",
"use_beam_search": true,
"n": 4,
"temperature": 0
}'
Note, this endpoint doesn’t have any authentication built in, however as its a fastapi app, it should be fairly straightforward to add token or JWK auth.