Appendix A

Transformer Architecture Fundamentals

Layer structure, parameter counts, and memory footprint analysis for Llama-class models.

A.1 Layer Structure

A transformer decoder block consists of two main components executed sequentially:

Figure A.1 — Transformer Layer Structure

Input (d_model = 8192)

↓

Self-Attention

↓ + residual

Feed-Forward Network

↓ + residual

Output (d_model = 8192)

× 80 layers for Llama-70B

Attention:

W_Q: d × (n_heads × d_head) = 8192 × 8192 = 67M

W_K: d × (n_kv × d_head) = 8192 × 1024 = 8.4M

W_V: d × (n_kv × d_head) = 8192 × 1024 = 8.4M

W_O: (n_heads × d_head) × d = 8192 × 8192 = 67M

FFN:

W_gate: d × d_ffn = 8192 × 28672 = 235M

W_up: d × d_ffn = 8192 × 28672 = 235M

W_down: d_ffn × d = 28672 × 8192 = 235M

Total:

Per layer: 67 + 8.4 + 8.4 + 67 + 235 + 235 + 235 = 856M

Embedding:

vocab × d = 32000 × 8192 = 262M

Layers:

80 × 856M = 68.5B

Total:

262M + 68.5B ≈ 69B parameters

Precision	Bytes/Param	Model Size	Notes
FP32	4	280 GB	Training
BF16/FP16	2	140 GB	Inference (typical)
INT8	1	70 GB	Quantized
INT4	0.5	35 GB	Aggressive quantization