Consider normalized floating-point numbers with = 10, p = 3, and emin = -98. The numbers x = 6.87 × 10-97 and y = 6.81 × 10-97 appear to be perfectly ordinary floating-point numbers, which are more than a factor of 10 larger than the smallest floating-point number 1.00 × 10-98. They have a strange property, however: x y = 0 even though x y! The reason is that x - y = .06 × 10 -97 = 6.0 × 10-99 is too small to be represented as a normalized number, and so must be flushed to zero. How important is it to preserve the property
(10) x = y x - y = 0 ?
It's very easy to imagine writing the code fragment, if (x y) then z = 1/(x-y), and much later having a program fail due to a spurious division by zero. Tracking down bugs like this is frustrating and time consuming. On a more philosophical level, computer science textbooks often point out that even though it is currently impractical to prove large programs correct, designing programs with the idea of proving them often results in better code. For example, introducing invariants is quite useful, even if they aren't going to be used as part of a proof. Floating-point code is just like any other code: it helps to have provable facts on which to depend. For example, when analyzing formula (6), it was very helpful to know that x/2 < y < 2x x y = x - y. Similarly, knowing that (10) is true makes writing reliable floating-point code easier. If it is only true for most numbers, it cannot be used to prove anything.
The IEEE standard uses denormalized18 numbers, which guarantee (10), as well as other useful relations. They are the most controversial part of the standard and probably accounted for the long delay in getting 754 approved. Most high performance hardware that claims to be IEEE compatible does not support denormalized numbers directly, but rather traps when consuming or producing denormals, and leaves it to software to simulate the IEEE standard.19 The idea behind denormalized numbers goes back to Goldberg [1967] and is very simple. When the exponent is emin, the significand does not have to be normalized, so that when = 10, p = 3 and emin = -98, 1.00 × 10-98 is no longer the smallest floating-point number, because 0.98 × 10-98 is also a floating-point number.
There is a small snag when = 2 and a hidden bit is being used, since a number with an exponent of emin will always have a significand greater than or equal to 1.0 because of the implicit leading bit. The solution is similar to that used to represent 0, and is summarized in TABLE D-2. The exponent emin is used to represent denormals. More formally, if the bits in the significand field are b1, b2, ..., bp -1, and the value of the exponent is e, then when e > emin - 1, the number being represented is 1.b1b2...bp - 1 × 2e whereas when e = emin - 1, the number being represented is 0.b1b2...bp - 1 × 2e + 1. The +1 in the exponent is needed because denormals have an exponent of emin, not emin - 1.
Recall the example of = 10, p = 3, emin = -98, x = 6.87 × 10-97 and y = 6.81 × 10-97 presented at the beginning of this section. With denormals, x - y does not flush to zero but is instead represented by the denormalized number .6 × 10-98. This behavior is called gradual underflow. It is easy to verify that (10) always holds when using gradual underflow.
Partager