When is round-trip floating point radix conversion exact?

Suppose you store a floating point number in memory, print it out in human-readable base 10, and read it back in.

When can the original number be recovered exactly? D.

W.

Matula answered this question more generally in 1968 [1].

Suppose we start with base β with p places of precision and convert to base γ with q places of precision, rounding to nearest, then convert back to the original base β.

Matula’s theorem says that if there are no positive integers i and j such that βi = γj then a necessary and sufficient condition for the round-trip to be exact (assuming no overflow or underflow) is that γq-1 > βp.

In the case of floating point numbers (type double in C) we have β = 2 and p = 53.

(See Anatomy of a floating point number.

) We’re printing to base γ = 10.

No positive power of 10 is also a power of 2, so Matula’s condition on the two bases holds.

If we print out q = 17 decimal places, then 1016 > 253 and so round-trip conversion will be exact if both conversions round to nearest.

If q is any smaller, some round-trip conversions will not be exact.

You can also verify that for a single precision floating point number (p = 24 bits precision) you need q = 9 decimal digits, and for a quad precision number (p = 113 bits precision) you need q = 36 decimal digits [2].

Looking back at Matula’s theorem, clearly we need γq ≥ βp.

Why? Because the right side is the number of base β fractions and the left side is the number of base γ fractions.

You can’t have a one-to-one map from a larger space into a smaller space.

So the inequality above is necessary, but not sufficient.

However, it’s almost sufficient.

We just need one more base γ figure, i.

e.

we Matula tells us γq-1 > βp is sufficient.

In terms of base 2 and base 10, we need at least 16 decimals to represent 53 bits.

The surprising thing is that one more decimal is enough to guarantee that round-trip conversions are exact.

It’s not obvious a priori that any finite number of extra decimals is always enough, but in fact just one more is enough.

More floating point posts The hardest logarithm to compute Floating point exceptions in C++ Sine of a googol [1] D.

W.

Matula.

In-and-out conversions.

Communications of the ACM, 11(1):47–50.

January 1968.

Cited in Handbook of Floating-point Arithmetic by Jean-Mihel Muller et al.

[2] The number of bits allocated for the fractional part of a floating point number is 1 less than the precision: the leading figure is always 1, so IEEE formats save one bit by not storing the leading bit, leaving it implicit.

So, for example, a C double has 53 bits precision, but 52 bits of the 64 bits in a double are are allocated to storing the fraction.

.

Leave a Reply