AI Model Quantization Techniques

Post-training quantization, QAT, and format options (int8, int4, fp8) for LLM inference.

Reference Reference Updated Apr 19, 2026

Format	Bits	Memory vs FP16	Quality loss	Notes
FP32	32	2×	—	Training default; overkill for inference
FP16 / BF16	16	1×	Baseline	Standard inference precision
FP8	8	0.5×	Tiny	E4M3 (precision) / E5M2 (range)
INT8	8	0.5×	Small	Widely supported
INT4	4	0.25×	Noticeable on small models	GPTQ, AWQ
INT2 / binary	1–2	tiny	Large	Research / extreme edge

Method	How	Use
Post-training (PTQ)	Quantize weights after training	Fast, no retraining
Calibration	Use small dataset to pick scales	Improves PTQ accuracy
Quantization-aware training (QAT)	Simulate low precision during training	Best accuracy at low bits
GPTQ	One-shot PTQ with second-order info	Popular for INT4 LLM weights
AWQ	Activation-aware scaling	Preserves important weights
GGUF (llama.cpp)	CPU/GPU formats, mixed precision	Default for on-device LLM inference
SmoothQuant	Shift activation outliers to weights	INT8 inference on LLMs
Mixed precision	Keep sensitive layers at FP16	Balances speed and accuracy

Memory ≈ params × bits / 8: a 7B model in INT4 ≈ 3.5 GB, FP16 ≈ 14 GB.
Larger models tolerate aggressive quantization better (70B at INT4 is often fine, 7B at INT4 can degrade).
Quantization affects perplexity; check benchmarks (MMLU, HumanEval) on your specific model + bits combo.
KV-cache quantization (separate from weight quantization) saves runtime memory for long contexts.

Last updated: April 19, 2026