Effective Latency Analysis

Effective Latency Formula

L_eff = hit_rate_dram Ã— L_dram + hit_rate_flash Ã— L_flash + miss_rate Ã— L_recompute

L_dram

DRAM access latency via CXL.mem path

~150â€“300 ns

L_flash

Flash access latency (NVMe behind endpoint controller)

~10â€“20 Î¼s

L_recompute

Cost of regenerating evicted KV entries

~50â€“200 ms

Comparison to PCIe Baseline

PCIe swap path latency components:

🚨 Page fault

🔧 Driver intervention

📋 DMA setup

📡 PCIe transfer

🔓 Completion interrupt

PCIe Path

~13 Î¼s

fault + driver + DMA + transfer + IRQ

CXL.mem Direct

~250 ns

load/store memory semantics

65Ã— Lower Latency

Under Load, PCIe Latency Degrades

Queue Depth

+50â€“200%

Interrupt Coalescing

+10â€“50 Î¼s

OS Scheduling

+100 Î¼s jitter

🧮

Example Calculation

Llama-70B, 128K context

DRAM Hit Rate

85%

Flash Hit Rate

14%

Miss Rate

Substituting Values

L_eff = (0.85 × 250ns) + (0.14 Ã— 15Î¼s) + (0.01 Ã— 100ms)

L_eff = 170ns + 2,100ns + 1,000,000ns

L_eff = 1,002,270 ns

Effective Latency

â‰ˆ 1.0 ms

💡

Recompute dominates. Even at just 1% miss rate, recompute contributes 99.8% of total latency. This is why intelligent caching that prevents misses is critical.

Intelligent Caching Prevents Thrashing

The endpoint's intelligent caching prevents thrashingâ€”high-value entries never evict, eliminating the miss cascades that cause queue buildup. EMA-based scoring ensures frequently-accessed KV entries remain in fast DRAM tier while cold entries gracefully demote to flash.