Floating-Point Errors

Where IEEE 754 bites — rounding, cancellation, infinities, NaN propagation, accumulation.

Reference Reference Updated Apr 19, 2026

0.1 + 0.2 = 0.30000000000000004: 0.1 and 0.2 have no exact binary representation.
x = x + 1 eventually stops working: once x > 2⁵³, adding 1 has no effect in FP64.
Associativity is lost: (a + b) + c may differ from a + (b + c).
Comparing floats with ==: use |a − b| < ε or relative tolerance.
Catastrophic cancellation: subtracting nearly-equal values loses precision.
Accumulation drift: summing many small values gradually loses precision — use Kahan summation.

Kahan summation: carries a running error term to recover lost precision.
Sort before summing: add small numbers first, largest last.
Use log-sum-exp: log(Σ e^x) = max(x) + log(Σ e^(x−max)) to avoid overflow.
Fused multiply-add (FMA): one rounding instead of two; available via std::fma or hardware.
Higher precision temporaries: do intermediate math in FP64 even if inputs/outputs are FP32.
Compensated algorithms: Dekker, Neumaier, etc. trade speed for accuracy.

Money: use fixed-point or decimal types (BigDecimal, money libs).
Exact comparisons: integers are exact; use scaled integers for deterministic arithmetic.
Cryptographic code: constant-time integer operations only.

Last updated: April 19, 2026