UC Berkeley Open Source

vLLM

High-throughput LLM inference and serving engine. PagedAttention revolutionizes memory management to achieve 24× higher throughput than HuggingFace Transformers.

24×

Higher Throughput

~0%

Memory Waste

70+

Model Support

Overview

What is vLLM?

A high-throughput and memory-efficient inference engine that makes LLM serving fast, affordable, and scalable.

The Memory Management Problem

Large Language Models store attention KV-cache during generation. Traditional systems pre-allocate memory for maximum sequence length, wasting 60-80% of GPU memory on sequences that never reach full length.

vLLM's PagedAttention solves this by storing KV-cache in non-contiguous memory blocks. Memory is allocated on-demand as tokens are generated, eliminating waste and enabling much higher batch sizes.

The result? 2-4× more concurrent requests per GPU, translating directly to 2-4× lower cost per token in production.

📥

API Request

OpenAI-compatible REST API

↓

📊

Scheduler

Continuous batching, priorities

↓

⚡

LLM Engine

PagedAttention, model execution

↓

📤

Token Streaming

Real-time SSE response

Architecture

Under the Hood

A deep dive into vLLM's modular, high-performance architecture

🌐

API Server

Interface

OpenAI-Compatible API

REST endpoints for drop-in replacement

AsyncLLMEngine

Async interface for direct integration

Streaming Support

SSE for real-time streaming

Multi-Model Serving

Serve multiple models from single deployment

🔧

LLM Engine

Core

Execution Loop

Main step() loop coordinating iterations

Model Workers

GPU processes holding model weights

Tokenizer

HuggingFace tokenizers

Sampling

Temperature, top-p, top-k, beam search

📋

Scheduler

Orchestration

Continuous Batching

Dynamic request addition/removal

Preemption

Pause low-priority requests

Priority Queues

High/low priority for SLAs

Prefix Caching

Reuse KV-cache for common prefixes

💾

Block Manager

Memory

PagedAttention

Non-contiguous KV-cache in fixed blocks

Block Tables

Logical-to-physical block mapping

Copy-on-Write

Efficient sharing for beam search

GPU↔CPU Swapping

Move blocks between GPU/CPU

⚡

Execution Layer

Compute

Custom Attention Kernels

Optimized CUDA for paged KV-cache

Flash Attention

Memory-efficient tiled attention

Tensor Parallelism

Multi-GPU with NCCL

CUDA Graphs

Capture and replay

Optimizations

The Magic Inside

Four key innovations that make vLLM the fastest LLM serving engine

📄

PagedAttention

Virtual memory for KV-cache

Traditional LLM serving pre-allocates for maximum sequence length, wasting 60-80% on fragmentation. PagedAttention stores KV-cache in fixed-size blocks allocated on-demand.

Traditional

Seq1

Pad

Seq2

Pad

Free

Seq3

Pad

Free

→

PagedAttention

S1-B0

S1-B1

S2-B0

S2-B1

S3-B0

Shared

S4-B0

S4-B1

S5-B0

S3-B1

S6-B0

S6-B1

Free

~0%

Memory Waste

2-4×

Batch Capacity

📊

Continuous Batching

Dynamic request scheduling

Traditional batching waits for all requests to complete. Continuous batching adds/removes requests between iterations, maximizing GPU utilization.

Req 1

●●●

Req 2

●●●●●

Req 3

●●●●

Req 4

●●●●●●

3-5×

Throughput Gain

95%+

GPU Utilization

⚡

CUDA Graphs

Eliminate kernel launch overhead

Each iteration launches hundreds of CUDA kernels. CUDA Graphs capture the entire sequence once, then replay with minimal CPU overhead, reducing latency by 10-20%.

Without Graphs

Launch K1 → Execute → Launch K2 → Execute → ...

500μs+ CPU overhead

→

With Graphs

Graph Launch → K1→K2→K3...

~5μs total overhead

10-20%

Latency Reduction

CPU Overhead

🎯

Speculative Decoding

Parallel token verification

Large models are slow per-token. Speculative decoding uses a small draft model to propose multiple tokens, then verifies in parallel with target. Accepted tokens skip expensive computation.

Draft: The → quick → brown → fox

↓ Verify all in ONE forward pass

Target: ✓The → ✓quick → ✓brown → ✗fox→dog

Result: 3 tokens accepted in 1 forward pass = 3× faster

2-3×

Generation Speedup

Quality Loss

Pipeline

Request Lifecycle

Step-by-step journey of a request through vLLM

📥

API Receipt

Parse request JSON. Validate parameters. Queue for processing.

📝

Tokenization

Convert prompt text to token IDs using model tokenizer.

📊

Scheduling

Allocate memory blocks. Join batch. Handle priorities.

⚡

Prefill

Process prompt tokens. Build KV-cache. Compute first token.

🔄

Decode

Generate iteratively. Update KV-cache. Sample next token.

📤

Stream Output

Detokenize. Stream via SSE. Return completion.

Performance

Real-World Results

Throughput comparison across LLM serving solutions

1×

PyTorch
Baseline

4-5×

HuggingFace
Transformers

2-3×

HuggingFace
TGI

24×

vLLM

Baseline

HuggingFace

TGI

vLLM (24× throughput)

Optimized For Your Workload

💬

Chat Applications

Interactive chatbots and assistants. Low latency streaming with high concurrency.

<100ms

TTFT

1000s

Concurrent

📦

Batch Processing

Offline processing of large datasets. Maximum throughput for document analysis.

24×

Throughput

90%

Cost Down

🚀

Production Serving

Scalable API endpoints. OpenAI-compatible for easy integration.

99.9%

Uptime

100%

API Compatible

Docker

Docker Deployment

Containerized inference with GPU support and optimized configurations

🐳

Quick Start

Official Docker Image

bash docker run

docker run --gpus all \
  -d \
  --name vllm-server \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-2-7b-chat-hf \
  --max-model-len 4096

⚠️ Requires NVIDIA Container Toolkit

📥 First run downloads model (~14GB for 7B)

🔑 HF_TOKEN needed for gated models

📦

Production Dockerfile

Optimized for Production

Dockerfile Dockerfile.prod

FROM vllm/vllm-openai:v0.4.0 AS base
WORKDIR /app
ENV VLLM_USAGE_STATS=0 \
    HF_HOME=/app/models \
    TRANSFORMERS_OFFLINE=0 \
    PYTHONUNBUFFERED=1

RUN useradd -m -u 1000 vllm && \
    mkdir -p /app/models && \
    chown -R vllm:vllm /app

USER vllm

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000

ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--host", "0.0.0.0", "--port", "8000"]

🎼

Docker Compose Stack

Production-Ready Compose File

yaml docker-compose.yml

version: '3.8'

services:
  vllm-1:
    image: vllm/vllm-openai:latest
    container_name: vllm-server-1
    restart: always
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_USAGE_STATS=0
    volumes:
      - model-cache:/root/.cache/huggingface
      - ./logs:/app/logs
    ipc: host
    shm_size: '16gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command:
      - --model=meta-llama/Llama-2-7b-chat-hf
      - --max-model-len=4096
      - --gpu-memory-utilization=0.9
      - --max-num-seqs=256
      - --trust-remote-code
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - vllm-network

  vllm-2:
    image: vllm/vllm-openai:latest
    container_name: vllm-server-2
    restart: always
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_USAGE_STATS=0
    volumes:
      - model-cache:/root/.cache/huggingface
      - ./logs:/app/logs
    ipc: host
    shm_size: '16gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command:
      - --model=meta-llama/Llama-2-7b-chat-hf
      - --max-model-len=4096
      - --gpu-memory-utilization=0.9
      - --max-num-seqs=256
      - --trust-remote-code
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
    networks:
      - vllm-network

  nginx:
    image: nginx:alpine
    container_name: vllm-lb
    restart: always
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      vllm-1:
        condition: service_healthy
      vllm-2:
        condition: service_healthy
    networks:
      - vllm-network

volumes:
  model-cache:
    driver: local

networks:
  vllm-network:
    driver: bridge

⚙️

Environment Variables

Configuration Reference

PYTORCH_CUDA_ALLOC_CONF - CUDA memory allocator config

VLLM_USAGE_STATS 1 Usage statistics (0=disabled)

HF_TOKEN - HuggingFace access token

CUDA_VISIBLE_DEVICES all GPU device selection

OMP_NUM_THREADS auto OpenMP thread count

NCCL_DEBUG WARN NCCL debug verbosity

RAY_ADDRESS - Ray cluster address

🧮

GPU Memory Guide

Memory Requirements by Model

7B ~16GB ~10GB ~6GB 24GB+

13B ~28GB ~16GB ~10GB 40GB+

34B ~72GB ~40GB ~22GB 80GB+

70B ~150GB ~80GB ~42GB 2×80GB

💡 KV cache adds ~2-4GB per 1000 concurrent tokens

Kubernetes

Kubernetes Deployment

Enterprise-grade orchestration with GPU scheduling and auto-scaling

☸️

Deployment Manifest

Basic GPU Deployment

yaml deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
  namespace: vllm
  labels:
    app: vllm
    env: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: vllm
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8000'
    spec:
      nodeSelector:
        nvidia.com/gpu: 'true'
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.4.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          args:
            - --model=meta-llama/Llama-2-7b-chat-hf
            - --host=0.0.0.0
            - --port=8000
            - --gpu-memory-utilization=0.9
            - --max-num-seqs=256
            - --max-model-len=4096
            - --tensor-parallel-size=1
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: token
            - name: VLLM_USAGE_STATS
              value: '0'
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
              cpu: '8'
            requests:
              nvidia.com/gpu: 1
              memory: 24Gi
              cpu: '4'
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 60
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi
        - name: model-cache
          persistentVolumeClaim:
            claimName: vllm-model-cache

🌐

Service & Ingress

Network Configuration

yaml service.yaml

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: vllm
spec:
  type: ClusterIP
  selector:
    app: vllm
  ports:
    - name: http
      port: 80
      targetPort: 8000
      protocol: TCP
  sessionAffinity: None
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: vllm
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/ssl-redirect: 'true'
    nginx.ingress.kubernetes.io/proxy-read-timeout: '3600'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '3600'
    nginx.ingress.kubernetes.io/proxy-body-size: 50m
    nginx.ingress.kubernetes.io/proxy-http-version: '1.1'
    nginx.ingress.kubernetes.io/limit-rps: '100'
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - llm-api.example.com
      secretName: vllm-tls-secret
  rules:
    - host: llm-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 80

📈

Auto-Scaling (HPA)

GPU-Aware Scaling

yaml hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: vllm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: '80'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

⎈

Production Helm Values

Enterprise Configuration

yaml values.yaml

global:
  imagePullSecrets:
    - name: regcred

vllm:
  image:
    repository: vllm/vllm-openai
    tag: v0.4.0
    pullPolicy: IfNotPresent

  model:
    name: meta-llama/Llama-2-13b-chat-hf
    maxModelLen: 4096
    tensorParallelSize: 2
    pipelineParallelSize: 1
    dtype: float16
    quantization: null

  server:
    gpuMemoryUtilization: 0.9
    maxNumSeqs: 256
    maxNumBatchedTokens: 8192
    blockSize: 16
    swapSpace: 4
    trustRemoteCode: true
    enablePrefixCaching: true

  replicaCount: 4

  resources:
    limits:
      nvidia.com/gpu: 2
      memory: 64Gi
      cpu: '16'
    requests:
      nvidia.com/gpu: 2
      memory: 48Gi
      cpu: '8'

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                  - NVIDIA-A100-SXM4-80GB
                  - NVIDIA-A100-SXM4-40GB
                  - NVIDIA-H100-80GB-HBM3

  podDisruptionBudget:
    minAvailable: 2

  priorityClassName: high-priority

service:
  type: ClusterIP
  port: 80
  annotations:
    prometheus.io/scrape: 'true'

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: llm-api.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: vllm-tls
      hosts:
        - llm-api.example.com

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 8
  targetCPUUtilizationPercentage: 70

monitoring:
  serviceMonitor:
    enabled: true
    interval: 30s
  grafanaDashboard:
    enabled: true

persistence:
  enabled: true
  storageClass: fast-ssd
  size: 200Gi
  accessMode: ReadWriteMany

Deployment Scenarios

Production Topologies

Architecture patterns for every scale and use case

Development

Scenario 1: Single GPU Deployment

One GPU, one model, direct access. Perfect for development and small-scale production.

👥

Clients

HTTP/REST

:8000

🖥️

vLLM Server

API + Scheduler

CUDA

🎮

GPU 0

LLaMA-7B (FP16)

Hardware

GPU 1× A100-40GB

RAM 64GB+

CPU 8+ cores

Performance

Throughput ~2000 tok/s

TTFT <50ms

Concurrency 50-100

Models

FP16 ≤13B

INT4 ≤70B

Docker docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-2-7b-chat-hf

Production

Scenario 2: Tensor Parallel (Multi-GPU)

Multiple GPUs on one node. Essential for large models that don't fit in single GPU.

👥

Clients

HTTP/REST

:8000

🖥️

vLLM Server (TP=4)

Tensor Parallel Coordinator

NCCL All-Reduce

🎮

GPU 0

Shard 0

🎮

GPU 1

Shard 1

🎮

GPU 2

Shard 2

🎮

GPU 3

Shard 3

LLaMA-70B (FP16)

Distributed across 4× A100-80GB

Hardware

GPUs 4× A100-80GB

Interconnect NVLink

RAM 256GB+

Performance

Throughput ~1500 tok/s

TTFT ~100ms

Concurrency 200-400

Config

TP Size 4

Mem Util 0.9

Docker docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-2-70b-chat-hf --tensor-parallel-size 4

Enterprise

Scenario 3: High Availability Cluster

Load-balanced replicas for fault tolerance and scale. No single point of failure.

👥

Clients

1000s req/s

HTTPS

⚖️

Load Balancer

Round Robin / Least Conn

Distribute

🖥️

Replica 1

Node 1: 2× GPU

🖥️

Replica 2

Node 2: 2× GPU

🖥️

Replica 3

Node 3: 2× GPU

💾

Model Storage

Shared Model Cache

Availability

Uptime SLA 99.9%

Fault Tolerance N-1

Recovery ~5 min

Performance

Throughput ~6000 tok/s

Latency Same

Concurrency 600-1200

Config

Replicas 3+

PDB 2

Enterprise

Scenario 4: Multi-Model Gateway

Single endpoint serving multiple models. Route by request parameter.

👥

Clients

"model": "llama-70b"

:8000

🔀

API Gateway

Model Router

Route

🦙

LLaMA-70B

4× GPU (TP=4)

Premium

🌪️

Mistral-7B

1× GPU

Standard

💻

CodeLlama-34B

2× GPU (TP=2)

Code

Resources

Total GPUs 7× A100

Models 3

Routing

Model Field "model"

Default Mistral-7B

Use Cases

Analysis LLaMA-70B

Chat Mistral-7B

Code CodeLlama

Hyperscale

Scenario 5: Multi-Node Pipeline Parallel

Largest models across multiple nodes. Pipeline stages + tensor shards.

👥

Clients

API

🎯

Ray Coordinator

Head Node

Pipeline Flow

Node 1 (PP=0) Layers 0-39

IB 400G

Node 2 (PP=1) Layers 40-79

LLaMA-405B / Frontier Model

PP=2, TP=8 → 16× H100-80GB

Hardware

GPUs 16× H100

Network 400Gb/s IB

RAM 4TB+

Parallelism

PP Size 2

TP Size 8

World Size 16

Performance

Throughput ~500 tok/s

TTFT ~500ms

Node 1 ray start --head --port=6379

Node 2 ray start --address=node1:6379

vLLM

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-405b --tensor-parallel-size 8 --pipeline-parallel-size 2

Scenario Comparison

Single GPU 1 13B/70B* 2K None Low Dev/Test

Tensor Parallel 2-8 70B 1.5-3K None Medium Large Models

HA Cluster 6+ 70B 6K+ 99.9% Medium Production

Multi-Model 7+ Mixed Varies Optional High A/B Testing

Pipeline Parallel 16+ 405B+ 500+ Optional High Research

* With INT4 quantization (AWQ/GPTQ)

Production Ready

Additional Features

Beyond PagedAttention, vLLM provides comprehensive features for production LLM deployment including broad model support, quantization, and distributed inference.

🤖

70+ Models

All popular open-source architectures

📉

Quantization

GPTQ, AWQ, SqueezeLLM, INT8, FP8

🔀

Multi-GPU

Tensor parallelism for large models

📡

Streaming

Server-Sent Events for real-time tokens

💾

Prefix Caching

Automatic KV-cache reuse

🖼️

Multimodal

LLaVA and vision-language models

🔧

LoRA Adapters

Efficient fine-tuned adapter serving

🔌

OpenAI API

Drop-in replacement for OpenAI