© 2025 Subramaniyam (Sam) Pooni
All Rights Reserved
Proprietary & Confidential
Technical Reference Documentation

Distributed Endpoint Architecture for
KV-Cache Offloading in LLM Inference

A comprehensive guide to memory-efficient large language model serving using CXL-based intelligent memory endpoints with hardware-accelerated cache management.

Memory Expansion
16×
User Capacity
97%
HBM Hit Rate
36%
Cost Reduction
💡 The Innovation Gap: What Nobody Has Yet
Per-head tracking • Attention-aware eviction • RoPE prefetch • Controller intelligence → Read the Pitch
📖 Chapters
CH 0 Executive Summary

High-level overview of the memory wall problem in LLM inference and the proposed intelligent endpoint architecture solution.

Problem Statement Key Results Architecture Overview
CH 1 Introduction

The von Neumann bottleneck, why KV-cache matters for LLM serving, and the scope of this technical documentation.

Memory Bottleneck KV-Cache Importance Document Scope
CH 2 Background

LLM inference fundamentals, the bandwidth-compute gap, and limitations of current approaches like vLLM and tensor parallelism.

Prefill vs Decode Memory Wall Current Solutions
CH 3 Distributed Endpoint Architecture

Core architecture concepts including memory controller offloading, CXL 3.0, and system topology for single and multi-node deployments.

CXL 3.0 Endpoint Design System Topology
CH 4 Effective Latency Analysis

Two-tier cache model analysis, latency formulas, and the 65× improvement over traditional PCIe paths.

Cache Tiers Latency Formula PCIe Comparison
CH 5 Bandwidth Aggregation

CXL switch topology for linear bandwidth scaling, layer prefetch strategies, and practical bandwidth calculations.

Switch Topology Prefetch Strategy Bandwidth Math
CH 6 Preprocessing Offload

Offloading tokenization, image processing, and format conversion to ARM cores in endpoints for 5-10× latency reduction.

Tokenization ARM Cores Data Conversion
CH 7 Intelligent KV-Cache Management

Per-head tracking, EMA-based attention scoring, RoPE-aware prefetch, and the eviction priority function achieving 97% hit rates.

Per-Head Tracking EMA Scoring RoPE Prefetch
CH 8 MoE Routing Support

Handling Mixture-of-Experts models with routing histograms, activation tracking, and adaptive caching strategies.

Expert Selection Histogram Tracking Cache Strategy
CH 9 GPU Integration

Memory mapping, hint interfaces, driver-to-firmware translation, and fault handling for seamless GPU access.

Memory Mapping Hint API Fault Handling
CH 10 Market Landscape

Analysis of commercial CXL products, software frameworks, recent research, and competitive differentiation.

Samsung CMM vLLM/Mooncake Differentiation
CH 11 Performance Analysis

TTFT improvements, compute vs IO-bound analysis, continuous batching impact, and asymptotic speedup models.

TTFT Analysis Crossover Points Speedup Models
CH 12 Implementation Considerations

Hardware requirements, software stack, driver modifications, firmware development, and deployment scenarios.

Hardware Reqs Software Stack Deployment
CH 13 Conclusion

Summary of contributions, key takeaways, and future research directions for next-generation LLM infrastructure.

Contributions Key Takeaways Future Work
📚 Technical Appendix