Floating-Point Errors

Where IEEE 754 bites — rounding, cancellation, infinities, NaN propagation, accumulation.

Reference Reference Updated Apr 19, 2026
Reference

Classic gotchas

  • 0.1 + 0.2 = 0.30000000000000004: 0.1 and 0.2 have no exact binary representation.
  • x = x + 1 eventually stops working: once x > 2⁵³, adding 1 has no effect in FP64.
  • Associativity is lost: (a + b) + c may differ from a + (b + c).
  • Comparing floats with ==: use |a − b| < ε or relative tolerance.
  • Catastrophic cancellation: subtracting nearly-equal values loses precision.
  • Accumulation drift: summing many small values gradually loses precision — use Kahan summation.

Special values

+0 and −0
Distinct but compare equal with ==
+∞ / −∞
Result of overflow or divide-by-zero
NaN
Not-a-Number — result of 0/0, ∞ − ∞, sqrt(−1)
NaN != NaN
Always false; use isnan(x) to test
Subnormal
Very small numbers below ~1e-308 in FP64 — slower on most CPUs

Precision tips

  • Kahan summation: carries a running error term to recover lost precision.
  • Sort before summing: add small numbers first, largest last.
  • Use log-sum-exp: log(Σ e^x) = max(x) + log(Σ e^(x−max)) to avoid overflow.
  • Fused multiply-add (FMA): one rounding instead of two; available via std::fma or hardware.
  • Higher precision temporaries: do intermediate math in FP64 even if inputs/outputs are FP32.
  • Compensated algorithms: Dekker, Neumaier, etc. trade speed for accuracy.

When to avoid floats

  • Money: use fixed-point or decimal types (BigDecimal, money libs).
  • Exact comparisons: integers are exact; use scaled integers for deterministic arithmetic.
  • Cryptographic code: constant-time integer operations only.