Floating-Point Errors
Where IEEE 754 bites — rounding, cancellation, infinities, NaN propagation, accumulation.
Reference
Classic gotchas
- 0.1 + 0.2 = 0.30000000000000004: 0.1 and 0.2 have no exact binary representation.
- x = x + 1 eventually stops working: once x > 2⁵³, adding 1 has no effect in FP64.
- Associativity is lost: (a + b) + c may differ from a + (b + c).
- Comparing floats with ==: use |a − b| < ε or relative tolerance.
- Catastrophic cancellation: subtracting nearly-equal values loses precision.
- Accumulation drift: summing many small values gradually loses precision — use Kahan summation.
Special values
- +0 and −0
- Distinct but compare equal with ==
- +∞ / −∞
- Result of overflow or divide-by-zero
- NaN
- Not-a-Number — result of 0/0, ∞ − ∞, sqrt(−1)
- NaN != NaN
- Always false; use isnan(x) to test
- Subnormal
- Very small numbers below ~1e-308 in FP64 — slower on most CPUs
Precision tips
- Kahan summation: carries a running error term to recover lost precision.
- Sort before summing: add small numbers first, largest last.
- Use log-sum-exp: log(Σ e^x) = max(x) + log(Σ e^(x−max)) to avoid overflow.
- Fused multiply-add (FMA): one rounding instead of two; available via std::fma or hardware.
- Higher precision temporaries: do intermediate math in FP64 even if inputs/outputs are FP32.
- Compensated algorithms: Dekker, Neumaier, etc. trade speed for accuracy.
When to avoid floats
- Money: use fixed-point or decimal types (BigDecimal, money libs).
- Exact comparisons: integers are exact; use scaled integers for deterministic arithmetic.
- Cryptographic code: constant-time integer operations only.
Last updated: