Grouped Query Attention

Balancing KV-cache memory efficiency with model quality

Query (Q)

Key (K)

Value (V)

MHA

Multi-Head Attention

KV Cache: 8 heads

GQA
Grouped Query Attention
Q0
1
2
3
K0-1
2-3
V0-1
2-3
KV Cache: 4 heads

MQA

Multi-Query Attention

shared

KV Cache: 2 heads

How GQA Works

Input Hidden States

â†“

W_Q

W_K

W_V

â†“

Qâ‚€ Qâ‚ Qâ‚‚ Qâ‚ƒ
n_heads = 4

Kâ‚€ Kâ‚
n_kv = 2

Vâ‚€ Vâ‚
n_kv = 2

â†“

Group 0: Qâ‚€,Qâ‚ â†’ Kâ‚€,Vâ‚€
Group 1: Qâ‚‚,Qâ‚ƒ â†’ Kâ‚,Vâ‚

â†“

Attention Output (concat all heads)

Why GQA Matters

💾

Smaller KV Cache

Reduces memory footprint proportional to the grouping ratio. Critical for long-context inference.

âš¡

Faster Decoding

Less KV data to load from HBM per token. Directly improves memory-bound decode throughput.

🎯

Quality Preserved

Outperforms MQA significantly. Llama 2 70B uses GQA with 8 KV heads for 64 query heads.

KV Cache Size = 2 Ã— n_layers Ã— n_kv_heads Ã— seq_len Ã— head_dim Ã— dtype_size