Web & Dev

AI Model Quantization Techniques

Post-training quantization, QAT, and format options (int8, int4, fp8) for LLM inference.

Precision levels

FormatBitsMemory vs FP16Quality lossNotes
FP3232Training default; overkill for inference
FP16 / BF1616BaselineStandard inference precision
FP880.5×TinyE4M3 (precision) / E5M2 (range)
INT880.5×SmallWidely supported
INT440.25×Noticeable on small modelsGPTQ, AWQ
INT2 / binary1–2tinyLargeResearch / extreme edge

Methods

MethodHowUse
Post-training (PTQ)Quantize weights after trainingFast, no retraining
CalibrationUse small dataset to pick scalesImproves PTQ accuracy
Quantization-aware training (QAT)Simulate low precision during trainingBest accuracy at low bits
GPTQOne-shot PTQ with second-order infoPopular for INT4 LLM weights
AWQActivation-aware scalingPreserves important weights
GGUF (llama.cpp)CPU/GPU formats, mixed precisionDefault for on-device LLM inference
SmoothQuantShift activation outliers to weightsINT8 inference on LLMs
Mixed precisionKeep sensitive layers at FP16Balances speed and accuracy

Rules of thumb

  • Memory ≈ params × bits / 8: a 7B model in INT4 ≈ 3.5 GB, FP16 ≈ 14 GB.
  • Larger models tolerate aggressive quantization better (70B at INT4 is often fine, 7B at INT4 can degrade).
  • Quantization affects perplexity; check benchmarks (MMLU, HumanEval) on your specific model + bits combo.
  • KV-cache quantization (separate from weight quantization) saves runtime memory for long contexts.
Was this article helpful?