Section 7

MoE Routing Support

Mixture-of-experts with intelligent endpoint prefetch

🧠
Mixture-of-Experts Architecture
Input Token
↓
Learned Router
Selects K experts per token
↓
E1✓
E2
E3✓
E4
E5
E6
E7
E8
N experts total, K activated per token | Example: N=8, K=2
âš  The Challenge: Data-Dependent Access
Expert activation depends on input content. The router learns which experts handle which types of inputs, but this means access patterns are:

• Data-dependent — can't predict without seeing input
• Irregular — no fixed pattern to exploit
• Sparse — only K of N experts active
Expert Activation Pattern (8 tokens)
Rows = tokens, Cols = experts. Orange = activated.
💡
Endpoint Approach: Routing Histogram
Expert Activation Frequency
E1
E2
E3
E4
E5
E6
E7
E8
P(act) threshold

Track frequency per token position or context type

Prefetch Strategy
if P(activation) > threshold:
→ Prefetch expert to endpoint DRAM
if P(activation) < threshold:
→ Keep in flash, fetch on demand
Histogram updates continuously based on observed routing decisions. Hot experts stay resident, cold experts stay in flash.
💾 Cache Sizing Strategy
✓ Optimal: Full Replication
If endpoint DRAM can hold all experts with replication, no prefetch needed. Route requests to endpoint with local expert copy.
Zero expert fetch latency
âš¡ Fallback: Histogram Prefetch
If DRAM < all experts, use histogram to prefetch likely experts. Cold experts fetched from flash on demand.
P(hit) × 0 + P(miss) × flash_latency
Route to Endpoint with Local Expert
Request
needs E1, E3
→
Endpoint A
1
3
5
not
Endpoint B
2
4
6