Floating-Point Formats

IEEE 754 and ML-specific float formats — bit layout, range, and precision.

Reference Reference Updated Apr 19, 2026

Format	Total bits	Sign	Exponent	Mantissa	Exponent bias	Max	Min normal	Decimal digits
FP64 (double)	64	1	11	52	1023	~1.8e308	~2.2e-308	~15.9
FP32 (float)	32	1	8	23	127	~3.4e38	~1.2e-38	~7.2
FP16 (half)	16	1	5	10	15	~65504	~6.1e-5	~3.3
BF16 (brain float)	16	1	8	7	127	~3.4e38	~1.2e-38	~2.4
FP8 E4M3	8	1	4	3	7	448	~1.95e-3	~1
FP8 E5M2	8	1	5	2	15	~57344	~6.1e-5	~0.8

FP16 vs BF16: same 16 bits; BF16 trades precision (7 mantissa bits) for FP32-matching exponent range — preferred for ML training.
FP8 formats are used for quantized training/inference; E5M2 has more range, E4M3 more precision.
Integer equivalence: 32-bit int exactly representable up to 2²⁴ in FP32, 2⁵³ in FP64.
0.1 + 0.2 ≠ 0.3: binary float can't represent decimal fractions exactly.

Last updated: April 19, 2026