Scaled dot-product attention, multi-head structure, and grouped-query attention (GQA).
Steps:
Instead of single attention, we use multiple parallel attention heads:
For Llama-70B: 64 query heads, each with dhead=128.
GQA reduces KV-cache by sharing K/V heads across multiple query heads:
| Type | Q Heads | KV Heads | KV per token |
|---|---|---|---|
| MHA | 64 | 64 | 16 KB/layer |
| GQA (Llama-70B) | 64 | 8 | 2 KB/layer |
| MQA | 64 | 1 | 256 B/layer |
GQA achieves 8× KV-cache reduction with minimal quality loss.