When a single user runs a local LLM, throughput doesn’t matter much. But when you’re serving 10, 100, or 1,000 concurrent requests — for an internal tool, API wrapper, or AI product — the inference engine becomes a critical bottleneck. vLLM is the open-source solution purpose-built for this problem.
What Makes vLLM Different
vLLM was developed at UC Berkeley and introduced two innovations that transformed LLM serving:
PagedAttention manages the KV (key-value) cache using virtual memory paging, similar to how an OS manages RAM. Traditional engines pre-allocate a fixed KV cache per request, wasting memory when sequences are shorter than the maximum. PagedAttention allocates cache in pages on demand, enabling far more concurrent requests from the same GPU.
Continuous batching (also called iteration-level scheduling) processes new requests without waiting for a full batch to complete. Traditional static batching makes request A wait for request B even if B is halfway done. Continuous batching slots in new requests the moment a sequence finishes, maximizing GPU utilization.
The result: vLLM typically delivers 10-20x higher throughput than naive HuggingFace Transformers serving at production concurrency levels.
Installation
vLLM requires Python 3.9+ and an NVIDIA GPU with CUDA 11.8 or newer. AMD ROCm support is available but less mature.
# Install from PyPI (CUDA 12.1)
pip install vllm
# For a specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118
# Verify the install
python -c "import vllm; print(vllm.__version__)"
Docker is often the cleanest production deployment:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct
Launching an OpenAI-Compatible Server
vLLM’s primary interface is an OpenAI-compatible REST API. Drop it in as a replacement for any OpenAI-based tool by changing the base_url:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000
For gated models on Hugging Face, set your token first:
export HF_TOKEN="hf_..."
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000
Testing with curl
Once the server is running, verify with a curl request:
# Chat completions endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what PagedAttention is in two sentences."}
],
"max_tokens": 200,
"temperature": 0.7
}'
# List available models
curl http://localhost:8000/v1/models
Using with the OpenAI Python Client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-required" # vLLM doesn't require an API key by default
)
response = client.chat.completions.create(
model="meta-llama/Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about inference speed"}],
max_tokens=100
)
print(response.choices[0].message.content)
GPU Memory Utilization Flag
By default, vLLM reserves 90% of available GPU VRAM for the KV cache. Adjust this based on your model size and available memory:
# Use 80% of GPU memory (leaves headroom for system)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.80
# For smaller VRAM GPUs, be conservative
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-memory-utilization 0.75 \
--max-model-len 4096 # Limit context to reduce KV cache size
Reducing --max-model-len is often the most effective way to fit a model onto a smaller GPU — you sacrifice context length but gain more concurrent request capacity.
Tensor Parallelism for Multi-GPU
For models that don’t fit on a single GPU, vLLM supports tensor parallelism — splitting the model across multiple GPUs:
# Split across 2 GPUs (e.g., for a 70B model on 2x RTX 4090)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000
# Split across 4 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4
Tensor parallelism requires NVLink for best performance. PCIe-connected GPUs work but with higher inter-GPU communication overhead.
Connecting to Open WebUI
Open WebUI provides a ChatGPT-like frontend that can connect to any OpenAI-compatible backend:
# Run Open WebUI pointing at vLLM
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=not-required \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Navigate to http://localhost:3000 for a full-featured chat interface backed by your vLLM server.
Connecting to LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-required",
model="meta-llama/Llama-3-8B-Instruct"
)
result = llm.invoke("What are the three laws of robotics?")
print(result.content)
Throughput Benchmarks
vLLM ships with a benchmarking tool to measure your setup’s performance:
# Download the benchmark script
wget https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/benchmark_throughput.py
# Run against a dataset
python benchmark_throughput.py \
--backend vllm \
--model meta-llama/Llama-3-8B-Instruct \
--dataset-name sharegpt \
--num-prompts 1000
Representative throughput numbers on RTX 4090 (single GPU):
| Model | Concurrent Requests | Tokens/Second |
|---|---|---|
| Llama 3 8B | 1 | ~80 tok/s |
| Llama 3 8B | 16 | ~450 tok/s |
| Llama 3 8B | 64 | ~800 tok/s |
| Llama 3 70B (2x 4090) | 8 | ~120 tok/s |
Higher concurrency produces dramatically higher aggregate throughput — the core value proposition of PagedAttention and continuous batching.
When to Use vLLM
vLLM is the right tool for:
- Internal APIs serving multiple team members
- AI products with concurrent users
- Batch processing large numbers of prompts
- Any scenario where GPU utilization matters
For single-user local inference, Ollama or LlamaFile are simpler. But the moment you’re serving more than one person, vLLM’s architecture pays off rapidly.