Floating-Point Formats
IEEE 754 and ML-specific float formats — bit layout, range, and precision.
Reference
Formats
| Format | Total bits | Sign | Exponent | Mantissa | Exponent bias | Max | Min normal | Decimal digits |
|---|---|---|---|---|---|---|---|---|
| FP64 (double) | 64 | 1 | 11 | 52 | 1023 | ~1.8e308 | ~2.2e-308 | ~15.9 |
| FP32 (float) | 32 | 1 | 8 | 23 | 127 | ~3.4e38 | ~1.2e-38 | ~7.2 |
| FP16 (half) | 16 | 1 | 5 | 10 | 15 | ~65504 | ~6.1e-5 | ~3.3 |
| BF16 (brain float) | 16 | 1 | 8 | 7 | 127 | ~3.4e38 | ~1.2e-38 | ~2.4 |
| FP8 E4M3 | 8 | 1 | 4 | 3 | 7 | 448 | ~1.95e-3 | ~1 |
| FP8 E5M2 | 8 | 1 | 5 | 2 | 15 | ~57344 | ~6.1e-5 | ~0.8 |
Special values (IEEE 754)
- ±0
- sign bit set/clear, exponent = 0, mantissa = 0
- ±∞
- exponent all 1s, mantissa = 0
- NaN
- exponent all 1s, mantissa ≠ 0 (quiet/signaling variants)
- Subnormals
- exponent = 0, mantissa ≠ 0 — gradual underflow
Notes
- FP16 vs BF16: same 16 bits; BF16 trades precision (7 mantissa bits) for FP32-matching exponent range — preferred for ML training.
- FP8 formats are used for quantized training/inference; E5M2 has more range, E4M3 more precision.
- Integer equivalence: 32-bit int exactly representable up to 2²⁴ in FP32, 2⁵³ in FP64.
- 0.1 + 0.2 ≠ 0.3: binary float can't represent decimal fractions exactly.
Last updated: