AI Tools #vllm#llm-inference#openai-api

vLLM: High-Throughput LLM Inference Server Setup Guide

Set up vLLM's OpenAI-compatible inference server with PagedAttention and continuous batching. Covers install, GPU config, multi-GPU, and benchmarks.

7 min read

When a single user runs a local LLM, throughput doesn’t matter much. But when you’re serving 10, 100, or 1,000 concurrent requests — for an internal tool, API wrapper, or AI product — the inference engine becomes a critical bottleneck. vLLM is the open-source solution purpose-built for this problem.

What Makes vLLM Different

vLLM was developed at UC Berkeley and introduced two innovations that transformed LLM serving:

PagedAttention manages the KV (key-value) cache using virtual memory paging, similar to how an OS manages RAM. Traditional engines pre-allocate a fixed KV cache per request, wasting memory when sequences are shorter than the maximum. PagedAttention allocates cache in pages on demand, enabling far more concurrent requests from the same GPU.

Continuous batching (also called iteration-level scheduling) processes new requests without waiting for a full batch to complete. Traditional static batching makes request A wait for request B even if B is halfway done. Continuous batching slots in new requests the moment a sequence finishes, maximizing GPU utilization.

The result: vLLM typically delivers 10-20x higher throughput than naive HuggingFace Transformers serving at production concurrency levels.

Installation

vLLM requires Python 3.9+ and an NVIDIA GPU with CUDA 11.8 or newer. AMD ROCm support is available but less mature.

# Install from PyPI (CUDA 12.1)
pip install vllm

# For a specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

# Verify the install
python -c "import vllm; print(vllm.__version__)"

Docker is often the cleanest production deployment:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3-8B-Instruct

Launching an OpenAI-Compatible Server

vLLM’s primary interface is an OpenAI-compatible REST API. Drop it in as a replacement for any OpenAI-based tool by changing the base_url:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

For gated models on Hugging Face, set your token first:

export HF_TOKEN="hf_..."
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Testing with curl

Once the server is running, verify with a curl request:

# Chat completions endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain what PagedAttention is in two sentences."}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

# List available models
curl http://localhost:8000/v1/models

Using with the OpenAI Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required"  # vLLM doesn't require an API key by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about inference speed"}],
    max_tokens=100
)
print(response.choices[0].message.content)

GPU Memory Utilization Flag

By default, vLLM reserves 90% of available GPU VRAM for the KV cache. Adjust this based on your model size and available memory:

# Use 80% of GPU memory (leaves headroom for system)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.80

# For smaller VRAM GPUs, be conservative
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-memory-utilization 0.75 \
  --max-model-len 4096  # Limit context to reduce KV cache size

Reducing --max-model-len is often the most effective way to fit a model onto a smaller GPU — you sacrifice context length but gain more concurrent request capacity.

Tensor Parallelism for Multi-GPU

For models that don’t fit on a single GPU, vLLM supports tensor parallelism — splitting the model across multiple GPUs:

# Split across 2 GPUs (e.g., for a 70B model on 2x RTX 4090)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --host 0.0.0.0 \
  --port 8000

# Split across 4 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4

Tensor parallelism requires NVLink for best performance. PCIe-connected GPUs work but with higher inter-GPU communication overhead.

Connecting to Open WebUI

Open WebUI provides a ChatGPT-like frontend that can connect to any OpenAI-compatible backend:

# Run Open WebUI pointing at vLLM
docker run -d -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=not-required \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Navigate to http://localhost:3000 for a full-featured chat interface backed by your vLLM server.

Connecting to LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",
    model="meta-llama/Llama-3-8B-Instruct"
)

result = llm.invoke("What are the three laws of robotics?")
print(result.content)

Throughput Benchmarks

vLLM ships with a benchmarking tool to measure your setup’s performance:

# Download the benchmark script
wget https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/benchmark_throughput.py

# Run against a dataset
python benchmark_throughput.py \
  --backend vllm \
  --model meta-llama/Llama-3-8B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000

Representative throughput numbers on RTX 4090 (single GPU):

ModelConcurrent RequestsTokens/Second
Llama 3 8B1~80 tok/s
Llama 3 8B16~450 tok/s
Llama 3 8B64~800 tok/s
Llama 3 70B (2x 4090)8~120 tok/s

Higher concurrency produces dramatically higher aggregate throughput — the core value proposition of PagedAttention and continuous batching.

When to Use vLLM

vLLM is the right tool for:

  • Internal APIs serving multiple team members
  • AI products with concurrent users
  • Batch processing large numbers of prompts
  • Any scenario where GPU utilization matters

For single-user local inference, Ollama or LlamaFile are simpler. But the moment you’re serving more than one person, vLLM’s architecture pays off rapidly.

#production-ai #gpu #openai-api #llm-inference #vllm