AI Model Quantization Techniques
Post-training quantization, QAT, and format options (int8, int4, fp8) for LLM inference.
Reference
Precision levels
| Format | Bits | Memory vs FP16 | Quality loss | Notes |
|---|---|---|---|---|
| FP32 | 32 | 2× | — | Training default; overkill for inference |
| FP16 / BF16 | 16 | 1× | Baseline | Standard inference precision |
| FP8 | 8 | 0.5× | Tiny | E4M3 (precision) / E5M2 (range) |
| INT8 | 8 | 0.5× | Small | Widely supported |
| INT4 | 4 | 0.25× | Noticeable on small models | GPTQ, AWQ |
| INT2 / binary | 1–2 | tiny | Large | Research / extreme edge |
Methods
| Method | How | Use |
|---|---|---|
| Post-training (PTQ) | Quantize weights after training | Fast, no retraining |
| Calibration | Use small dataset to pick scales | Improves PTQ accuracy |
| Quantization-aware training (QAT) | Simulate low precision during training | Best accuracy at low bits |
| GPTQ | One-shot PTQ with second-order info | Popular for INT4 LLM weights |
| AWQ | Activation-aware scaling | Preserves important weights |
| GGUF (llama.cpp) | CPU/GPU formats, mixed precision | Default for on-device LLM inference |
| SmoothQuant | Shift activation outliers to weights | INT8 inference on LLMs |
| Mixed precision | Keep sensitive layers at FP16 | Balances speed and accuracy |
Rules of thumb
- Memory ≈ params × bits / 8: a 7B model in INT4 ≈ 3.5 GB, FP16 ≈ 14 GB.
- Larger models tolerate aggressive quantization better (70B at INT4 is often fine, 7B at INT4 can degrade).
- Quantization affects perplexity; check benchmarks (MMLU, HumanEval) on your specific model + bits combo.
- KV-cache quantization (separate from weight quantization) saves runtime memory for long contexts.
Last updated: