Three-tier design, effective latency formulas, and hit rate analysis.
| Tier | Media | Capacity | Latency |
|---|---|---|---|
| Tier 0: HBM Pinned | GPU HBM | ~5 GB | 100 ns |
| Tier 1: HBM Evictable | GPU HBM | ~37 GB | 100 ns |
| Tier 2: CXL DRAM | Endpoint DDR5 | 1 TB | 250 ns |
| Tier 3: Flash | NVMe SSD | 16 TB | 25 μs |
With 95% HBM hit rate, 4.5% CXL hit, 0.5% flash:
| Technique | Contribution | Cumulative |
|---|---|---|
| LRU baseline | — | 70% |
| + Anchor pinning | +8% | 78% |
| + EMA scoring | +7% | 85% |
| + Per-head tracking | +6% | 91% |
| + RoPE prefetch | +4% | 95% |