KV-Cache Offloading Architecture — Complete Technical Reference

CH 0 Executive Summary

High-level overview of the memory wall problem in LLM inference and the proposed intelligent endpoint architecture solution.

Problem Statement Key Results Architecture Overview

CH 1 Introduction

The von Neumann bottleneck, why KV-cache matters for LLM serving, and the scope of this technical documentation.

Memory Bottleneck KV-Cache Importance Document Scope

CH 2 Background

LLM inference fundamentals, the bandwidth-compute gap, and limitations of current approaches like vLLM and tensor parallelism.

Prefill vs Decode Memory Wall Current Solutions

CH 3 Distributed Endpoint Architecture

Core architecture concepts including memory controller offloading, CXL 3.0, and system topology for single and multi-node deployments.

CXL 3.0 Endpoint Design System Topology

CH 4 Effective Latency Analysis

Two-tier cache model analysis, latency formulas, and the 65× improvement over traditional PCIe paths.

Cache Tiers Latency Formula PCIe Comparison

CH 5 Bandwidth Aggregation

CXL switch topology for linear bandwidth scaling, layer prefetch strategies, and practical bandwidth calculations.

Switch Topology Prefetch Strategy Bandwidth Math

CH 6 Preprocessing Offload

Offloading tokenization, image processing, and format conversion to ARM cores in endpoints for 5-10× latency reduction.

Tokenization ARM Cores Data Conversion

CH 7 Intelligent KV-Cache Management

Per-head tracking, EMA-based attention scoring, RoPE-aware prefetch, and the eviction priority function achieving 97% hit rates.

Per-Head Tracking EMA Scoring RoPE Prefetch

CH 8 MoE Routing Support

Handling Mixture-of-Experts models with routing histograms, activation tracking, and adaptive caching strategies.

Expert Selection Histogram Tracking Cache Strategy

CH 9 GPU Integration

Memory mapping, hint interfaces, driver-to-firmware translation, and fault handling for seamless GPU access.

Memory Mapping Hint API Fault Handling

CH 10 Market Landscape

Analysis of commercial CXL products, software frameworks, recent research, and competitive differentiation.

Samsung CMM vLLM/Mooncake Differentiation

CH 11 Performance Analysis

TTFT improvements, compute vs IO-bound analysis, continuous batching impact, and asymptotic speedup models.

TTFT Analysis Crossover Points Speedup Models

CH 12 Implementation Considerations

Hardware requirements, software stack, driver modifications, firmware development, and deployment scenarios.

Hardware Reqs Software Stack Deployment

CH 13 Conclusion

Summary of contributions, key takeaways, and future research directions for next-generation LLM infrastructure.

Contributions Key Takeaways Future Work

Distributed Endpoint Architecture forKV-Cache Offloading in LLM Inference