Theme

Custom Colors

Accent

#d4943a

Background

#0c0e11

Header

#141719

Cards

#1a1d21

Accessibility Presets

Font

Code and tool outputs stay monospace.

Readability

Font Size

16px

Line Height

1.6

Letter Spacing

0px

Web & Dev

AI Model Quantization Techniques

Post-training quantization, QAT, and format options (int8, int4, fp8) for LLM inference.

Updated Apr 19, 2026 2 min read

Precision levels

Format	Bits	Memory vs FP16	Quality loss	Notes
FP32	32	2×	—	Training default; overkill for inference
FP16 / BF16	16	1×	Baseline	Standard inference precision
FP8	8	0.5×	Tiny	E4M3 (precision) / E5M2 (range)
INT8	8	0.5×	Small	Widely supported
INT4	4	0.25×	Noticeable on small models	GPTQ, AWQ
INT2 / binary	1–2	tiny	Large	Research / extreme edge

Methods

Method	How	Use
Post-training (PTQ)	Quantize weights after training	Fast, no retraining
Calibration	Use small dataset to pick scales	Improves PTQ accuracy
Quantization-aware training (QAT)	Simulate low precision during training	Best accuracy at low bits
GPTQ	One-shot PTQ with second-order info	Popular for INT4 LLM weights
AWQ	Activation-aware scaling	Preserves important weights
GGUF (llama.cpp)	CPU/GPU formats, mixed precision	Default for on-device LLM inference
SmoothQuant	Shift activation outliers to weights	INT8 inference on LLMs
Mixed precision	Keep sensitive layers at FP16	Balances speed and accuracy

Rules of thumb

Memory ≈ params × bits / 8: a 7B model in INT4 ≈ 3.5 GB, FP16 ≈ 14 GB.
Larger models tolerate aggressive quantization better (70B at INT4 is often fine, 7B at INT4 can degrade).
Quantization affects perplexity; check benchmarks (MMLU, HumanEval) on your specific model + bits combo.
KV-cache quantization (separate from weight quantization) saves runtime memory for long contexts.

Was this article helpful?