© 2025 Subramaniyam (Sam) Pooni
All Rights Reserved
Proprietary & Confidential
Appendix B

Attention Mechanism Deep Dive

Scaled dot-product attention, multi-head structure, and grouped-query attention (GQA).

B.1 Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QKT / √dk) × V

Steps:

  1. Score: QKT computes similarity between queries and keys
  2. Scale: Divide by √dk to prevent gradient vanishing
  3. Normalize: Softmax converts scores to probabilities
  4. Aggregate: Weighted sum of values

B.2 Multi-Head Attention

Instead of single attention, we use multiple parallel attention heads:

MultiHead(Q, K, V) = Concat(head1, ..., headh)WO

For Llama-70B: 64 query heads, each with dhead=128.

B.3 Grouped-Query Attention (GQA)

GQA reduces KV-cache by sharing K/V heads across multiple query heads:

TypeQ HeadsKV HeadsKV per token
MHA646416 KB/layer
GQA (Llama-70B)6482 KB/layer
MQA641256 B/layer

GQA achieves 8× KV-cache reduction with minimal quality loss.