Appendix B

Attention Mechanism Deep Dive

Scaled dot-product attention, multi-head structure, and grouped-query attention (GQA).

B.1 Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Steps:

Instead of single attention, we use multiple parallel attention heads:

MultiHead(Q, K, V) = Concat(head₁, ..., head_h)W^O

For Llama-70B: 64 query heads, each with d_head=128.

GQA reduces KV-cache by sharing K/V heads across multiple query heads:

Type	Q Heads	KV Heads	KV per token
MHA	64	64	16 KB/layer
GQA (Llama-70B)	64	8	2 KB/layer
MQA	64	1	256 B/layer

GQA achieves 8× KV-cache reduction with minimal quality loss.