Section 4

Effective Latency Analysis

Two-tier cache: endpoint DRAM and endpoint flash

Effective Latency Formula
L_eff = hit_rate_dram × L_dram + hit_rate_flash × L_flash + miss_rate × L_recompute
L_dram
DRAM access latency via CXL.mem path
~150–300 ns
L_flash
Flash access latency (NVMe behind endpoint controller)
~10–20 μs
L_recompute
Cost of regenerating evicted KV entries
~50–200 ms
Comparison to PCIe Baseline

PCIe swap path latency components:

🚨 Page fault
🔧 Driver intervention
📋 DMA setup
📡 PCIe transfer
🔓 Completion interrupt
PCIe Path
~13 μs
fault + driver + DMA + transfer + IRQ
CXL.mem Direct
~250 ns
load/store memory semantics
65× Lower Latency
Under Load, PCIe Latency Degrades
Queue Depth
+50–200%
Interrupt Coalescing
+10–50 μs
OS Scheduling
+100 μs jitter
🧮
Example Calculation
Llama-70B, 128K context
DRAM Hit Rate
85%
Flash Hit Rate
14%
Miss Rate
1%
Substituting Values
L_eff = (0.85 × 250ns) + (0.14 × 15μs) + (0.01 × 100ms)
L_eff = 170ns + 2,100ns + 1,000,000ns
L_eff = 1,002,270 ns
Effective Latency
≈ 1.0 ms
💡
Recompute dominates. Even at just 1% miss rate, recompute contributes 99.8% of total latency. This is why intelligent caching that prevents misses is critical.
Intelligent Caching Prevents Thrashing

The endpoint's intelligent caching prevents thrashing—high-value entries never evict, eliminating the miss cascades that cause queue buildup. EMA-based scoring ensures frequently-accessed KV entries remain in fast DRAM tier while cold entries gracefully demote to flash.