UC Berkeley Open Source

vLLM

High-throughput LLM inference and serving engine. PagedAttention revolutionizes memory management to achieve 24× higher throughput than HuggingFace Transformers.

24×
Higher Throughput
~0%
Memory Waste
70+
Model Support

What is vLLM?

A high-throughput and memory-efficient inference engine that makes LLM serving fast, affordable, and scalable.

The Memory Management Problem

Large Language Models store attention KV-cache during generation. Traditional systems pre-allocate memory for maximum sequence length, wasting 60-80% of GPU memory on sequences that never reach full length.

vLLM's PagedAttention solves this by storing KV-cache in non-contiguous memory blocks. Memory is allocated on-demand as tokens are generated, eliminating waste and enabling much higher batch sizes.

The result? 2-4× more concurrent requests per GPU, translating directly to 2-4× lower cost per token in production.

📥

API Request

OpenAI-compatible REST API

📊

Scheduler

Continuous batching, priorities

LLM Engine

PagedAttention, model execution

📤

Token Streaming

Real-time SSE response

Under the Hood

A deep dive into vLLM's modular, high-performance architecture

🌐
API Server
Interface
OpenAI-Compatible API

REST endpoints for drop-in replacement

AsyncLLMEngine

Async interface for direct integration

Streaming Support

SSE for real-time streaming

Multi-Model Serving

Serve multiple models from single deployment

🔧
LLM Engine
Core
Execution Loop

Main step() loop coordinating iterations

Model Workers

GPU processes holding model weights

Tokenizer

HuggingFace tokenizers

Sampling

Temperature, top-p, top-k, beam search

📋
Scheduler
Orchestration
Continuous Batching

Dynamic request addition/removal

Preemption

Pause low-priority requests

Priority Queues

High/low priority for SLAs

Prefix Caching

Reuse KV-cache for common prefixes

💾
Block Manager
Memory
PagedAttention

Non-contiguous KV-cache in fixed blocks

Block Tables

Logical-to-physical block mapping

Copy-on-Write

Efficient sharing for beam search

GPU↔CPU Swapping

Move blocks between GPU/CPU

Execution Layer
Compute
Custom Attention Kernels

Optimized CUDA for paged KV-cache

Flash Attention

Memory-efficient tiled attention

Tensor Parallelism

Multi-GPU with NCCL

CUDA Graphs

Capture and replay

The Magic Inside

Four key innovations that make vLLM the fastest LLM serving engine

📄

PagedAttention

Virtual memory for KV-cache

Traditional LLM serving pre-allocates for maximum sequence length, wasting 60-80% on fragmentation. PagedAttention stores KV-cache in fixed-size blocks allocated on-demand.

Traditional
Seq1
Seq1
Pad
Pad
Seq2
Pad
Pad
Pad
Free
Free
Seq3
Pad
Free
Free
Free
Free
PagedAttention
S1-B0
S1-B1
S2-B0
S2-B1
S3-B0
Shared
S4-B0
S4-B1
S5-B0
S3-B1
S6-B0
S6-B1
Free
Free
Free
Free
~0%
Memory Waste
2-4×
Batch Capacity
📊

Continuous Batching

Dynamic request scheduling

Traditional batching waits for all requests to complete. Continuous batching adds/removes requests between iterations, maximizing GPU utilization.

Req 1
●●●
Req 2
●●●●●
Req 3
●●●●
Req 4
●●●●●●
3-5×
Throughput Gain
95%+
GPU Utilization

CUDA Graphs

Eliminate kernel launch overhead

Each iteration launches hundreds of CUDA kernels. CUDA Graphs capture the entire sequence once, then replay with minimal CPU overhead, reducing latency by 10-20%.

Without Graphs
Launch K1 → Execute → Launch K2 → Execute → ...
500μs+ CPU overhead
With Graphs
Graph Launch → K1→K2→K3...
~5μs total overhead
10-20%
Latency Reduction
~0
CPU Overhead
🎯

Speculative Decoding

Parallel token verification

Large models are slow per-token. Speculative decoding uses a small draft model to propose multiple tokens, then verifies in parallel with target. Accepted tokens skip expensive computation.

Draft: The → quick → brown → fox
↓ Verify all in ONE forward pass
Target: ✓The → ✓quick → ✓brown → ✗fox→dog
Result: 3 tokens accepted in 1 forward pass = 3× faster
2-3×
Generation Speedup
0
Quality Loss

Request Lifecycle

Step-by-step journey of a request through vLLM

1
📥

API Receipt

Parse request JSON. Validate parameters. Queue for processing.

2
📝

Tokenization

Convert prompt text to token IDs using model tokenizer.

3
📊

Scheduling

Allocate memory blocks. Join batch. Handle priorities.

4

Prefill

Process prompt tokens. Build KV-cache. Compute first token.

5
🔄

Decode

Generate iteratively. Update KV-cache. Sample next token.

6
📤

Stream Output

Detokenize. Stream via SSE. Return completion.

Real-World Results

Throughput comparison across LLM serving solutions

PyTorch
Baseline
4-5×
HuggingFace
Transformers
2-3×
HuggingFace
TGI
24×
vLLM
Baseline
HuggingFace
TGI
vLLM (24× throughput)

Optimized For Your Workload

💬

Chat Applications

Interactive chatbots and assistants. Low latency streaming with high concurrency.

<100ms
TTFT
1000s
Concurrent
📦

Batch Processing

Offline processing of large datasets. Maximum throughput for document analysis.

24×
Throughput
90%
Cost Down
🚀

Production Serving

Scalable API endpoints. OpenAI-compatible for easy integration.

99.9%
Uptime
100%
API Compatible

Docker Deployment

Containerized inference with GPU support and optimized configurations

🐳

Quick Start

Official Docker Image

bash docker run
docker run --gpus all \
  -d \
  --name vllm-server \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-2-7b-chat-hf \
  --max-model-len 4096
⚠️ Requires NVIDIA Container Toolkit
📥 First run downloads model (~14GB for 7B)
🔑 HF_TOKEN needed for gated models
📦

Production Dockerfile

Optimized for Production

Dockerfile Dockerfile.prod
FROM vllm/vllm-openai:v0.4.0 AS base
WORKDIR /app
ENV VLLM_USAGE_STATS=0 \
    HF_HOME=/app/models \
    TRANSFORMERS_OFFLINE=0 \
    PYTHONUNBUFFERED=1

RUN useradd -m -u 1000 vllm && \
    mkdir -p /app/models && \
    chown -R vllm:vllm /app

USER vllm

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000

ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--host", "0.0.0.0", "--port", "8000"]
🎼

Docker Compose Stack

Production-Ready Compose File

yaml docker-compose.yml
version: '3.8'

services:
  vllm-1:
    image: vllm/vllm-openai:latest
    container_name: vllm-server-1
    restart: always
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_USAGE_STATS=0
    volumes:
      - model-cache:/root/.cache/huggingface
      - ./logs:/app/logs
    ipc: host
    shm_size: '16gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command:
      - --model=meta-llama/Llama-2-7b-chat-hf
      - --max-model-len=4096
      - --gpu-memory-utilization=0.9
      - --max-num-seqs=256
      - --trust-remote-code
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - vllm-network

  vllm-2:
    image: vllm/vllm-openai:latest
    container_name: vllm-server-2
    restart: always
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_USAGE_STATS=0
    volumes:
      - model-cache:/root/.cache/huggingface
      - ./logs:/app/logs
    ipc: host
    shm_size: '16gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command:
      - --model=meta-llama/Llama-2-7b-chat-hf
      - --max-model-len=4096
      - --gpu-memory-utilization=0.9
      - --max-num-seqs=256
      - --trust-remote-code
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - vllm-network

  nginx:
    image: nginx:alpine
    container_name: vllm-lb
    restart: always
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      vllm-1:
        condition: service_healthy
      vllm-2:
        condition: service_healthy
    networks:
      - vllm-network

volumes:
  model-cache:
    driver: local

networks:
  vllm-network:
    driver: bridge
⚙️

Environment Variables

Configuration Reference

Variable Default Description
PYTORCH_CUDA_ALLOC_CONF - CUDA memory allocator config
VLLM_USAGE_STATS 1 Usage statistics (0=disabled)
HF_TOKEN - HuggingFace access token
CUDA_VISIBLE_DEVICES all GPU device selection
OMP_NUM_THREADS auto OpenMP thread count
NCCL_DEBUG WARN NCCL debug verbosity
RAY_ADDRESS - Ray cluster address
🧮

GPU Memory Guide

Memory Requirements by Model

Model FP16 INT8 INT4 GPU
7B ~16GB ~10GB ~6GB 24GB+
13B ~28GB ~16GB ~10GB 40GB+
34B ~72GB ~40GB ~22GB 80GB+
70B ~150GB ~80GB ~42GB 2×80GB
💡 KV cache adds ~2-4GB per 1000 concurrent tokens

Kubernetes Deployment

Enterprise-grade orchestration with GPU scheduling and auto-scaling

☸️

Deployment Manifest

Basic GPU Deployment

yaml deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
  namespace: vllm
  labels:
    app: vllm
    env: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: vllm
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8000'
    spec:
      nodeSelector:
        nvidia.com/gpu: 'true'
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.4.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          args:
            - --model=meta-llama/Llama-2-7b-chat-hf
            - --host=0.0.0.0
            - --port=8000
            - --gpu-memory-utilization=0.9
            - --max-num-seqs=256
            - --max-model-len=4096
            - --tensor-parallel-size=1
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
            - name: VLLM_USAGE_STATS
              value: '0'
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
              cpu: '8'
            requests:
              nvidia.com/gpu: 1
              memory: 24Gi
              cpu: '4'
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 60
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi
        - name: model-cache
          persistentVolumeClaim:
            claimName: vllm-model-cache
🌐

Service & Ingress

Network Configuration

yaml service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: vllm
spec:
  type: ClusterIP
  selector:
    app: vllm
  ports:
    - name: http
      port: 80
      targetPort: 8000
      protocol: TCP
  sessionAffinity: None
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: vllm
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/ssl-redirect: 'true'
    nginx.ingress.kubernetes.io/proxy-read-timeout: '3600'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '3600'
    nginx.ingress.kubernetes.io/proxy-body-size: 50m
    nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
    nginx.ingress.kubernetes.io/limit-rps: '100'
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - llm-api.example.com
      secretName: vllm-tls-secret
  rules:
    - host: llm-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 80
📈

Auto-Scaling (HPA)

GPU-Aware Scaling

yaml hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: vllm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: '80'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Production Helm Values

Enterprise Configuration

yaml values.yaml
global:
  imagePullSecrets:
    - name: regcred

vllm:
  image:
    repository: vllm/vllm-openai
    tag: v0.4.0
    pullPolicy: IfNotPresent

  model:
    name: meta-llama/Llama-2-13b-chat-hf
    maxModelLen: 4096
    tensorParallelSize: 2
    pipelineParallelSize: 1
    dtype: float16
    quantization: null

  server:
    gpuMemoryUtilization: 0.9
    maxNumSeqs: 256
    maxNumBatchedTokens: 8192
    blockSize: 16
    swapSpace: 4
    trustRemoteCode: true
    enablePrefixCaching: true

  replicaCount: 4

  resources:
    limits:
      nvidia.com/gpu: 2
      memory: 64Gi
      cpu: '16'
    requests:
      nvidia.com/gpu: 2
      memory: 48Gi
      cpu: '8'

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                  - NVIDIA-A100-SXM4-80GB
                  - NVIDIA-A100-SXM4-40GB
                  - NVIDIA-H100-80GB-HBM3

  podDisruptionBudget:
    minAvailable: 2

  priorityClassName: high-priority

service:
  type: ClusterIP
  port: 80
  annotations:
    prometheus.io/scrape: 'true'

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: llm-api.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: vllm-tls
      hosts:
        - llm-api.example.com

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 8
  targetCPUUtilizationPercentage: 70

monitoring:
  serviceMonitor:
    enabled: true
    interval: 30s
  grafanaDashboard:
    enabled: true

persistence:
  enabled: true
  storageClass: fast-ssd
  size: 200Gi
  accessMode: ReadWriteMany

Production Topologies

Architecture patterns for every scale and use case

Development

Scenario 1: Single GPU Deployment

One GPU, one model, direct access. Perfect for development and small-scale production.

👥
Clients
HTTP/REST
:8000
🖥️
vLLM Server
API + Scheduler
CUDA
🎮
GPU 0
LLaMA-7B (FP16)

Hardware

GPU 1× A100-40GB
RAM 64GB+
CPU 8+ cores

Performance

Throughput ~2000 tok/s
TTFT <50ms
Concurrency 50-100

Models

FP16 ≤13B
INT4 ≤70B
Docker docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-2-7b-chat-hf
Production

Scenario 2: Tensor Parallel (Multi-GPU)

Multiple GPUs on one node. Essential for large models that don't fit in single GPU.

👥
Clients
HTTP/REST
:8000
🖥️
vLLM Server (TP=4)
Tensor Parallel Coordinator
NCCL All-Reduce
🎮
GPU 0
Shard 0
🎮
GPU 1
Shard 1
🎮
GPU 2
Shard 2
🎮
GPU 3
Shard 3
LLaMA-70B (FP16)
Distributed across 4× A100-80GB

Hardware

GPUs 4× A100-80GB
Interconnect NVLink
RAM 256GB+

Performance

Throughput ~1500 tok/s
TTFT ~100ms
Concurrency 200-400

Config

TP Size 4
Mem Util 0.9
Docker docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-2-70b-chat-hf --tensor-parallel-size 4
Enterprise

Scenario 3: High Availability Cluster

Load-balanced replicas for fault tolerance and scale. No single point of failure.

👥
Clients
1000s req/s
HTTPS
⚖️
Load Balancer
Round Robin / Least Conn
Distribute
🖥️
Replica 1
Node 1: 2× GPU
🖥️
Replica 2
Node 2: 2× GPU
🖥️
Replica 3
Node 3: 2× GPU
💾
Model Storage
Shared Model Cache

Availability

Uptime SLA 99.9%
Fault Tolerance N-1
Recovery ~5 min

Performance

Throughput ~6000 tok/s
Latency Same
Concurrency 600-1200

Config

Replicas 3+
PDB 2
Enterprise

Scenario 4: Multi-Model Gateway

Single endpoint serving multiple models. Route by request parameter.

👥
Clients
"model": "llama-70b"
:8000
🔀
API Gateway
Model Router
Route
🦙
LLaMA-70B
4× GPU (TP=4)
Premium
🌪️
Mistral-7B
1× GPU
Standard
💻
CodeLlama-34B
2× GPU (TP=2)
Code

Resources

Total GPUs 7× A100
Models 3

Routing

Model Field "model"
Default Mistral-7B

Use Cases

Analysis LLaMA-70B
Chat Mistral-7B
Code CodeLlama
Hyperscale

Scenario 5: Multi-Node Pipeline Parallel

Largest models across multiple nodes. Pipeline stages + tensor shards.

👥
Clients
API
🎯
Ray Coordinator
Head Node
Pipeline Flow
Node 1 (PP=0) Layers 0-39
G0
G1
G2
G3
G4
G5
G6
G7
IB 400G
Node 2 (PP=1) Layers 40-79
G0
G1
G2
G3
G4
G5
G6
G7
LLaMA-405B / Frontier Model
PP=2, TP=8 → 16× H100-80GB

Hardware

GPUs 16× H100
Network 400Gb/s IB
RAM 4TB+

Parallelism

PP Size 2
TP Size 8
World Size 16

Performance

Throughput ~500 tok/s
TTFT ~500ms
Node 1 ray start --head --port=6379
Node 2 ray start --address=node1:6379
vLLM python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-405b --tensor-parallel-size 8 --pipeline-parallel-size 2

Scenario Comparison

Scenario GPUs Max Model Throughput Availability Complexity Best For
Single GPU 1 13B/70B* 2K None Low Dev/Test
Tensor Parallel 2-8 70B 1.5-3K None Medium Large Models
HA Cluster 6+ 70B 6K+ 99.9% Medium Production
Multi-Model 7+ Mixed Varies Optional High A/B Testing
Pipeline Parallel 16+ 405B+ 500+ Optional High Research

* With INT4 quantization (AWQ/GPTQ)

Production Ready

Additional Features

Beyond PagedAttention, vLLM provides comprehensive features for production LLM deployment including broad model support, quantization, and distributed inference.

🤖

70+ Models

All popular open-source architectures

📉

Quantization

GPTQ, AWQ, SqueezeLLM, INT8, FP8

🔀

Multi-GPU

Tensor parallelism for large models

📡

Streaming

Server-Sent Events for real-time tokens

💾

Prefix Caching

Automatic KV-cache reuse

🖼️

Multimodal

LLaVA and vision-language models

🔧

LoRA Adapters

Efficient fine-tuned adapter serving

🔌

OpenAI API

Drop-in replacement for OpenAI