Layer structure, parameter counts, and memory footprint analysis for Llama-class models.
A transformer decoder block consists of two main components executed sequentially:
| Parameter | Llama-7B | Llama-70B |
|---|---|---|
| Layers (L) | 32 | 80 |
| Hidden dim (dmodel) | 4096 | 8192 |
| FFN dim (dffn) | 11008 | 28672 |
| Query heads (nheads) | 32 | 64 |
| KV heads (nkv) | 32 | 8 |
| Head dim (dhead) | 128 | 128 |
| Vocab size | 32000 | 32000 |
| Precision | Bytes/Param | Model Size | Notes |
|---|---|---|---|
| FP32 | 4 | 280 GB | Training |
| BF16/FP16 | 2 | 140 GB | Inference (typical) |
| INT8 | 1 | 70 GB | Quantized |
| INT4 | 0.5 | 35 GB | Aggressive quantization |