MoE Routing Support

🧠

Mixture-of-Experts Architecture

Input Token

â†“

Learned Router

Selects K experts per token

â†“

E1âœ“

E3âœ“

N experts total, K activated per token | Example: N=8, K=2

âš The Challenge: Data-Dependent Access

Expert activation depends on input content. The router learns which experts handle which types of inputs, but this means access patterns are:

â€¢ Data-dependent â€” can't predict without seeing input
â€¢ Irregular â€” no fixed pattern to exploit
â€¢ Sparse â€” only K of N experts active

Expert Activation Pattern (8 tokens)

Rows = tokens, Cols = experts. Orange = activated.

💡

Endpoint Approach: Routing Histogram

Expert Activation Frequency

P(act) threshold

Track frequency per token position or context type

Prefetch Strategy

if P(activation) > threshold:

â†’ Prefetch expert to endpoint DRAM

if P(activation) < threshold:

â†’ Keep in flash, fetch on demand

Histogram updates continuously based on observed routing decisions. Hot experts stay resident, cold experts stay in flash.

💾 Cache Sizing Strategy

âœ“ Optimal: Full Replication

If endpoint DRAM can hold all experts with replication, no prefetch needed. Route requests to endpoint with local expert copy.

Zero expert fetch latency

âš¡ Fallback: Histogram Prefetch

If DRAM < all experts, use histogram to prefetch likely experts. Cold experts fetched from flash on demand.

P(hit) Ã— 0 + P(miss) Ã— flash_latency

Route to Endpoint with Local Expert

Request
needs E1, E3

â†’

Endpoint A

not

Endpoint B