Floating-Point Formats

IEEE 754 and ML-specific float formats — bit layout, range, and precision.

Reference Reference Updated Apr 19, 2026
Reference

Formats

Format Total bits Sign Exponent Mantissa Exponent bias Max Min normal Decimal digits
FP64 (double) 64 1 11 52 1023 ~1.8e308 ~2.2e-308 ~15.9
FP32 (float) 32 1 8 23 127 ~3.4e38 ~1.2e-38 ~7.2
FP16 (half) 16 1 5 10 15 ~65504 ~6.1e-5 ~3.3
BF16 (brain float) 16 1 8 7 127 ~3.4e38 ~1.2e-38 ~2.4
FP8 E4M3 8 1 4 3 7 448 ~1.95e-3 ~1
FP8 E5M2 8 1 5 2 15 ~57344 ~6.1e-5 ~0.8

Special values (IEEE 754)

±0
sign bit set/clear, exponent = 0, mantissa = 0
±∞
exponent all 1s, mantissa = 0
NaN
exponent all 1s, mantissa ≠ 0 (quiet/signaling variants)
Subnormals
exponent = 0, mantissa ≠ 0 — gradual underflow

Notes

  • FP16 vs BF16: same 16 bits; BF16 trades precision (7 mantissa bits) for FP32-matching exponent range — preferred for ML training.
  • FP8 formats are used for quantized training/inference; E5M2 has more range, E4M3 more precision.
  • Integer equivalence: 32-bit int exactly representable up to 2²⁴ in FP32, 2⁵³ in FP64.
  • 0.1 + 0.2 ≠ 0.3: binary float can't represent decimal fractions exactly.

Last updated: