Numbers & Math

Floating-Point Formats

IEEE 754 and ML-specific float formats — bit layout, range, and precision.

Formats

FormatTotal bitsSignExponentMantissaExponent biasMaxMin normalDecimal digits
FP64 (double)64111521023~1.8e308~2.2e-308~15.9
FP32 (float)321823127~3.4e38~1.2e-38~7.2
FP16 (half)16151015~65504~6.1e-5~3.3
BF16 (brain float)16187127~3.4e38~1.2e-38~2.4
FP8 E4M381437448~1.95e-3~1
FP8 E5M2815215~57344~6.1e-5~0.8

Special values (IEEE 754)

±0sign bit set/clear, exponent = 0, mantissa = 0
±∞exponent all 1s, mantissa = 0
NaNexponent all 1s, mantissa ≠ 0 (quiet/signaling variants)
Subnormalsexponent = 0, mantissa ≠ 0 — gradual underflow

Notes

  • FP16 vs BF16: same 16 bits; BF16 trades precision (7 mantissa bits) for FP32-matching exponent range — preferred for ML training.
  • FP8 formats are used for quantized training/inference; E5M2 has more range, E4M3 more precision.
  • Integer equivalence: 32-bit int exactly representable up to 2²⁴ in FP32, 2⁵³ in FP64.
  • 0.1 + 0.2 ≠ 0.3: binary float can't represent decimal fractions exactly.
Was this article helpful?