vLLM
High-throughput LLM inference and serving engine. PagedAttention revolutionizes memory management to achieve 24× higher throughput than HuggingFace Transformers.
What is vLLM?
A high-throughput and memory-efficient inference engine that makes LLM serving fast, affordable, and scalable.
The Memory Management Problem
Large Language Models store attention KV-cache during generation. Traditional systems pre-allocate memory for maximum sequence length, wasting 60-80% of GPU memory on sequences that never reach full length.
vLLM's PagedAttention solves this by storing KV-cache in non-contiguous memory blocks. Memory is allocated on-demand as tokens are generated, eliminating waste and enabling much higher batch sizes.
The result? 2-4× more concurrent requests per GPU, translating directly to 2-4× lower cost per token in production.
API Request
OpenAI-compatible REST API
Scheduler
Continuous batching, priorities
LLM Engine
PagedAttention, model execution
Token Streaming
Real-time SSE response
Under the Hood
A deep dive into vLLM's modular, high-performance architecture
OpenAI-Compatible API
REST endpoints for drop-in replacement
AsyncLLMEngine
Async interface for direct integration
Streaming Support
SSE for real-time streaming
Multi-Model Serving
Serve multiple models from single deployment
Execution Loop
Main step() loop coordinating iterations
Model Workers
GPU processes holding model weights
Tokenizer
HuggingFace tokenizers
Sampling
Temperature, top-p, top-k, beam search
Continuous Batching
Dynamic request addition/removal
Preemption
Pause low-priority requests
Priority Queues
High/low priority for SLAs
Prefix Caching
Reuse KV-cache for common prefixes
PagedAttention
Non-contiguous KV-cache in fixed blocks
Block Tables
Logical-to-physical block mapping
Copy-on-Write
Efficient sharing for beam search
GPU↔CPU Swapping
Move blocks between GPU/CPU
Custom Attention Kernels
Optimized CUDA for paged KV-cache
Flash Attention
Memory-efficient tiled attention
Tensor Parallelism
Multi-GPU with NCCL
CUDA Graphs
Capture and replay
The Magic Inside
Four key innovations that make vLLM the fastest LLM serving engine
PagedAttention
Virtual memory for KV-cache
Traditional LLM serving pre-allocates for maximum sequence length, wasting 60-80% on fragmentation. PagedAttention stores KV-cache in fixed-size blocks allocated on-demand.
Continuous Batching
Dynamic request scheduling
Traditional batching waits for all requests to complete. Continuous batching adds/removes requests between iterations, maximizing GPU utilization.
CUDA Graphs
Eliminate kernel launch overhead
Each iteration launches hundreds of CUDA kernels. CUDA Graphs capture the entire sequence once, then replay with minimal CPU overhead, reducing latency by 10-20%.
Speculative Decoding
Parallel token verification
Large models are slow per-token. Speculative decoding uses a small draft model to propose multiple tokens, then verifies in parallel with target. Accepted tokens skip expensive computation.
Request Lifecycle
Step-by-step journey of a request through vLLM
API Receipt
Parse request JSON. Validate parameters. Queue for processing.
Tokenization
Convert prompt text to token IDs using model tokenizer.
Scheduling
Allocate memory blocks. Join batch. Handle priorities.
Prefill
Process prompt tokens. Build KV-cache. Compute first token.
Decode
Generate iteratively. Update KV-cache. Sample next token.
Stream Output
Detokenize. Stream via SSE. Return completion.
Real-World Results
Throughput comparison across LLM serving solutions
Optimized For Your Workload
Chat Applications
Interactive chatbots and assistants. Low latency streaming with high concurrency.
Batch Processing
Offline processing of large datasets. Maximum throughput for document analysis.
Production Serving
Scalable API endpoints. OpenAI-compatible for easy integration.
Docker Deployment
Containerized inference with GPU support and optimized configurations
Quick Start
Official Docker Image
docker run --gpus all \
-d \
--name vllm-server \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model meta-llama/Llama-2-7b-chat-hf \
--max-model-len 4096
Production Dockerfile
Optimized for Production
FROM vllm/vllm-openai:v0.4.0 AS base
WORKDIR /app
ENV VLLM_USAGE_STATS=0 \
HF_HOME=/app/models \
TRANSFORMERS_OFFLINE=0 \
PYTHONUNBUFFERED=1
RUN useradd -m -u 1000 vllm && \
mkdir -p /app/models && \
chown -R vllm:vllm /app
USER vllm
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--host", "0.0.0.0", "--port", "8000"]
Docker Compose Stack
Production-Ready Compose File
version: '3.8'
services:
vllm-1:
image: vllm/vllm-openai:latest
container_name: vllm-server-1
restart: always
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
- HF_TOKEN=${HF_TOKEN}
- VLLM_USAGE_STATS=0
volumes:
- model-cache:/root/.cache/huggingface
- ./logs:/app/logs
ipc: host
shm_size: '16gb'
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command:
- --model=meta-llama/Llama-2-7b-chat-hf
- --max-model-len=4096
- --gpu-memory-utilization=0.9
- --max-num-seqs=256
- --trust-remote-code
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 300s
networks:
- vllm-network
vllm-2:
image: vllm/vllm-openai:latest
container_name: vllm-server-2
restart: always
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=1
- HF_TOKEN=${HF_TOKEN}
- VLLM_USAGE_STATS=0
volumes:
- model-cache:/root/.cache/huggingface
- ./logs:/app/logs
ipc: host
shm_size: '16gb'
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command:
- --model=meta-llama/Llama-2-7b-chat-hf
- --max-model-len=4096
- --gpu-memory-utilization=0.9
- --max-num-seqs=256
- --trust-remote-code
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 300s
networks:
- vllm-network
nginx:
image: nginx:alpine
container_name: vllm-lb
restart: always
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
vllm-1:
condition: service_healthy
vllm-2:
condition: service_healthy
networks:
- vllm-network
volumes:
model-cache:
driver: local
networks:
vllm-network:
driver: bridge
Environment Variables
Configuration Reference
GPU Memory Guide
Memory Requirements by Model
Kubernetes Deployment
Enterprise-grade orchestration with GPU scheduling and auto-scaling
Deployment Manifest
Basic GPU Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
namespace: vllm
labels:
app: vllm
env: production
spec:
replicas: 2
selector:
matchLabels:
app: vllm
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: vllm
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8000'
spec:
nodeSelector:
nvidia.com/gpu: 'true'
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.4.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8000
name: http
protocol: TCP
args:
- --model=meta-llama/Llama-2-7b-chat-hf
- --host=0.0.0.0
- --port=8000
- --gpu-memory-utilization=0.9
- --max-num-seqs=256
- --max-model-len=4096
- --tensor-parallel-size=1
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
- name: VLLM_USAGE_STATS
value: '0'
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
cpu: '8'
requests:
nvidia.com/gpu: 1
memory: 24Gi
cpu: '4'
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache
Service & Ingress
Network Configuration
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: vllm
spec:
type: ClusterIP
selector:
app: vllm
ports:
- name: http
port: 80
targetPort: 8000
protocol: TCP
sessionAffinity: None
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
namespace: vllm
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/ssl-redirect: 'true'
nginx.ingress.kubernetes.io/proxy-read-timeout: '3600'
nginx.ingress.kubernetes.io/proxy-send-timeout: '3600'
nginx.ingress.kubernetes.io/proxy-body-size: 50m
nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
nginx.ingress.kubernetes.io/limit-rps: '100'
spec:
ingressClassName: nginx
tls:
- hosts:
- llm-api.example.com
secretName: vllm-tls-secret
rules:
- host: llm-api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-service
port:
number: 80
Auto-Scaling (HPA)
GPU-Aware Scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: vllm
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: '80'
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Production Helm Values
Enterprise Configuration
global:
imagePullSecrets:
- name: regcred
vllm:
image:
repository: vllm/vllm-openai
tag: v0.4.0
pullPolicy: IfNotPresent
model:
name: meta-llama/Llama-2-13b-chat-hf
maxModelLen: 4096
tensorParallelSize: 2
pipelineParallelSize: 1
dtype: float16
quantization: null
server:
gpuMemoryUtilization: 0.9
maxNumSeqs: 256
maxNumBatchedTokens: 8192
blockSize: 16
swapSpace: 4
trustRemoteCode: true
enablePrefixCaching: true
replicaCount: 4
resources:
limits:
nvidia.com/gpu: 2
memory: 64Gi
cpu: '16'
requests:
nvidia.com/gpu: 2
memory: 48Gi
cpu: '8'
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-80GB
- NVIDIA-A100-SXM4-40GB
- NVIDIA-H100-80GB-HBM3
podDisruptionBudget:
minAvailable: 2
priorityClassName: high-priority
service:
type: ClusterIP
port: 80
annotations:
prometheus.io/scrape: 'true'
ingress:
enabled: true
className: nginx
hosts:
- host: llm-api.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: vllm-tls
hosts:
- llm-api.example.com
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 8
targetCPUUtilizationPercentage: 70
monitoring:
serviceMonitor:
enabled: true
interval: 30s
grafanaDashboard:
enabled: true
persistence:
enabled: true
storageClass: fast-ssd
size: 200Gi
accessMode: ReadWriteMany
Production Topologies
Architecture patterns for every scale and use case
Scenario 1: Single GPU Deployment
One GPU, one model, direct access. Perfect for development and small-scale production.
Hardware
Performance
Models
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-2-7b-chat-hf
Scenario 2: Tensor Parallel (Multi-GPU)
Multiple GPUs on one node. Essential for large models that don't fit in single GPU.
Hardware
Performance
Config
docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-2-70b-chat-hf --tensor-parallel-size 4
Scenario 3: High Availability Cluster
Load-balanced replicas for fault tolerance and scale. No single point of failure.
Availability
Performance
Config
Scenario 4: Multi-Model Gateway
Single endpoint serving multiple models. Route by request parameter.
Resources
Routing
Use Cases
Scenario 5: Multi-Node Pipeline Parallel
Largest models across multiple nodes. Pipeline stages + tensor shards.
Hardware
Parallelism
Performance
ray start --head --port=6379
ray start --address=node1:6379
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-405b --tensor-parallel-size 8 --pipeline-parallel-size 2
Scenario Comparison
* With INT4 quantization (AWQ/GPTQ)
Additional Features
Beyond PagedAttention, vLLM provides comprehensive features for production LLM deployment including broad model support, quantization, and distributed inference.
70+ Models
All popular open-source architectures
Quantization
GPTQ, AWQ, SqueezeLLM, INT8, FP8
Multi-GPU
Tensor parallelism for large models
Streaming
Server-Sent Events for real-time tokens
Prefix Caching
Automatic KV-cache reuse
Multimodal
LLaVA and vision-language models
LoRA Adapters
Efficient fine-tuned adapter serving
OpenAI API
Drop-in replacement for OpenAI