Understanding why recent tokens receive more attention than distant ones
Attention weights when generating the 1000th token:
Since ~80% of attention goes to the most recent ~10% of tokens, we can keep recent tokens in fast HBM memory and move older tokens to slower CXL memory. Most accesses will hit the fast tier, keeping average latency low.