Chapter 1

Introduction: The Memory Wall Problem

Why KV-cache is the critical bottleneck in LLM inference and what it costs the industry.

Figures in Chapter

41 GB

KV-Cache @ 128K

320 KB

Per Token

1.1 The Memory Wall

Modern LLMs face a fundamental constraint: the KV-cache grows linearly with context length, rapidly exhausting GPU memory.

Visual Appendix — Introduction Figures Open Full Screen ↗

Figure 1.4-1.5 — Distributed Endpoint Architecture Open Full Screen ↗

Interactive KV-Cache Size Calculator View Source ↗