© 2025 Subramaniyam (Sam) Pooni
All Rights Reserved
Proprietary & Confidential
Appendix A

Transformer Architecture Fundamentals

Layer structure, parameter counts, and memory footprint analysis for Llama-class models.

A.1 Layer Structure

A transformer decoder block consists of two main components executed sequentially:

Figure A.1 — Transformer Layer Structure
Input (dmodel = 8192)
Self-Attention
↓ + residual
Feed-Forward Network
↓ + residual
Output (dmodel = 8192)
× 80 layers for Llama-70B

A.2 Model Dimensions

ParameterLlama-7BLlama-70B
Layers (L)3280
Hidden dim (dmodel)40968192
FFN dim (dffn)1100828672
Query heads (nheads)3264
KV heads (nkv)328
Head dim (dhead)128128
Vocab size3200032000

A.3 Parameter Count Derivation

Per-Layer Parameters

Attention:
WQ: d × (n_heads × d_head) = 8192 × 8192 = 67M
WK: d × (n_kv × d_head) = 8192 × 1024 = 8.4M
WV: d × (n_kv × d_head) = 8192 × 1024 = 8.4M
WO: (n_heads × d_head) × d = 8192 × 8192 = 67M
FFN:
Wgate: d × d_ffn = 8192 × 28672 = 235M
Wup: d × d_ffn = 8192 × 28672 = 235M
Wdown: d_ffn × d = 28672 × 8192 = 235M
Total:
Per layer: 67 + 8.4 + 8.4 + 67 + 235 + 235 + 235 = 856M

Total Model Parameters

Embedding:
vocab × d = 32000 × 8192 = 262M
Layers:
80 × 856M = 68.5B
Total:
262M + 68.5B ≈ 69B parameters

A.4 Memory Footprint by Precision

PrecisionBytes/ParamModel SizeNotes
FP324280 GBTraining
BF16/FP162140 GBInference (typical)
INT8170 GBQuantized
INT40.535 GBAggressive quantization