AI Model Quantization Techniques

Post-training quantization, QAT, and format options (int8, int4, fp8) for LLM inference.

Reference Reference Updated Apr 19, 2026
Reference

Precision levels

Format Bits Memory vs FP16 Quality loss Notes
FP32 32 Training default; overkill for inference
FP16 / BF16 16 Baseline Standard inference precision
FP8 8 0.5× Tiny E4M3 (precision) / E5M2 (range)
INT8 8 0.5× Small Widely supported
INT4 4 0.25× Noticeable on small models GPTQ, AWQ
INT2 / binary 1–2 tiny Large Research / extreme edge

Methods

Method How Use
Post-training (PTQ) Quantize weights after training Fast, no retraining
Calibration Use small dataset to pick scales Improves PTQ accuracy
Quantization-aware training (QAT) Simulate low precision during training Best accuracy at low bits
GPTQ One-shot PTQ with second-order info Popular for INT4 LLM weights
AWQ Activation-aware scaling Preserves important weights
GGUF (llama.cpp) CPU/GPU formats, mixed precision Default for on-device LLM inference
SmoothQuant Shift activation outliers to weights INT8 inference on LLMs
Mixed precision Keep sensitive layers at FP16 Balances speed and accuracy

Rules of thumb

  • Memory ≈ params × bits / 8: a 7B model in INT4 ≈ 3.5 GB, FP16 ≈ 14 GB.
  • Larger models tolerate aggressive quantization better (70B at INT4 is often fine, 7B at INT4 can degrade).
  • Quantization affects perplexity; check benchmarks (MMLU, HumanEval) on your specific model + bits combo.
  • KV-cache quantization (separate from weight quantization) saves runtime memory for long contexts.

Last updated: