(Last Mod: 22 January 2012 13:55:31 )
The computer-based representation of floating point values involves many issues ranging from the obvious to the very subtle. Historically, computer hardware and software developers devised proprietary representations based on the capabilities of their hardware and the needs of their applications. Most of these representations were reasonably well thought-out and met their needs but some had significant shortcomings requiring significant software maintenance efforts to overcome after the fact. Out of this environment grew a standardized means of representing floating point values that involved a great deal of input in order to address many practical issues as well as some issues from the realm of number theory. This representation is embodied in a standard developed by the Institute of Electrical and Electronic Engineers (IEEE) and published as the IEEE-754 Floating Point Standard. Nearly all modern computers that support floating point operations use this standard - but there are exceptions. These exceptions generally arise out of a need to support older representations or to support hardware that is incompatible with the IEEE-754 representation. While possible, it is unlikely you will ever be called upon to deal with these other representations.
Instead of simply presenting the IEEE-754 representation along with a bunch of rules for how to work with it, we will develop the representation from scratch. The scenario will be that we are naive engineers that have been called upon to develop a floating point representation having certain characteristics. We will approach the problem from the standpoint of people that are familiar with positional numbering systems and computer based representation of integers. As such, we will make several attempts to develop a successful representation and, at each stage, will identify the key shortcomings of our present attempt and devise a way to overcome it in the next. In the end, we will have a representation that embodies nearly all of the features of the IEEE-754 standard.
It should be kept in mind, however, that no claim is being made that the development steps presented here reflect that actual development of the IEEE-754 standard in any way. When the IEEE-754 committee began meeting, most (if not all) of these issues where well known and ways to deal with them, at least separately, were well understood.
The representation of floating point values is not particularly straightforward as it embodies some rather interesting properties. But once those properties are understood, the peculiarities of the representation become quite understandable.
First, let's start with a basic floating point representation and explore its deficiencies. Then we will move toward the accepted representation as a result of addressing these deficiencies.
Recall that a floating point value consists of two parts - the mantissa and the exponent. If our representation is to fit into a fixed number of bits - which it must do - then we must decide how many bits to use for the mantissa and how many bits to use for the exponent.
As with most things in engineering, this decision involves a compromise. To understand this compromise and make it in an informed way, we need to understand the properties that are being traded off against each other. In the case of the mantissa and the exponent, we are trading static range for dynamic range, which will be described shortly. The more bits we allocated to the mantissa, the better the dynamic range, but it is achieved at the expense of the static range. Conversely, we can improve static range by increasing the number of bits used for the exponent if we are willing to accept a more restricted dynamic range. The only alternative is to increase the total number of bits used in the overall representation.
For the purposes of our discussions below on static and dynamic range, let's use the decimal system and stipulate that the mantissa can consist of at most six digits, while the exponent can exist of at most two digits.
The "static range" of a system is the ratio of the (magnitude of the) largest value that can be represented to the (magnitude of the) smallest value that can be represented.
Using our example system, the largest value we could represent would be
999999 x 10^{99}
while the smallest would be
0.00001 x 10^{-99}
The static range is therefore:
99999900000 x 10^{198} ~= 10^{209}
Note that, at this point, we have not imposed any requirement regarding whether our representation requires the mantissa to be normalized or not.
While the static range of the system is a useful quantity, it doesn't tell us everything. In particular, we certainly can't represent a change in a value to anything approaching the static range of the system.
For instance, in the above system we can represent both the size of the U.S. Public Debt (something on the order of seven trillion dollars), and the value of a penny very nicely. But we can't represent a change in the Public Debt of only a penny. In fact, the smallest change in the Public Debt that we could represent is ten million dollars!
So we need another measure that tells us the resolution of our representation. This is tightly related to the concept of "significant figures" that most people learned about in their early chemistry and physics courses. The basic measure of resolution is to ask how close can two numbers be to each other and still be represented as different values? The dynamic range will therefore be defined as the ratio of a number to the smallest change that can be represented in that number.
Using our example system, we could look at the difference between 1.00000 and 1.00001 or the difference between 9.99999 and 9.99998. In either case, the amount by which they differ would be 0.00001. But the calculated dynamic range would vary by a factor of ten depending on which pair of values we chose to use. If we use the first pair we would claim a dynamic range of essentially 100,000, or five digits (10^{5}). But if we use the latter pair we could claim a dynamic range of 1,000,000 or six digits.
A marketing representative would almost certainly elect to use the larger value because it allows them to claim a higher level of accuracy even though, under practical conditions, the actual accuracy attainable is always less. But engineers generally prefer to use the smaller value because then the resulting claimed accuracy is a minimum level that is always achieved. Using this as a guideline, the commonly accepted way of expressing dynamic range in a floating point value is to express the smallest value, typically called epsilon, such that 1 and 1+epsilon are two distinct values. So in our example system epsilon would be 0.00001 or five digits of dynamic range.
Now that we understand the concepts of static and dynamic range, we can start building our floating point representation. For simplicity sake, we will work with a one byte floating point number and we will allocate one nibble for the mantissa and one nibble for the exponent.
The first thing we must decide is where the radix point in the mantissa is located. We can choose it to be anywhere, but we must then be consistent with that choice. This is important because the radix point is not stored along with the value - the person writing the value is only going to write the eight bits and the person reading the value only has those eight bits available. The reader has to be able to rely on the writer having placed the radix point at the expected location relative to the bits.
For starters, let's place the radix point at the right of the mantissa and see what the implications are. We can always change it later.
Our first inclination is probably to have both the mantissa and the exponent be two's complement signed values, since we have seen the advantages of the two's complement representation when used with integers.
Floating Point Representation #1
b_{7} | b_{6} | b_{5} | b_{4} | (rp) | b_{3} | b_{2} | b_{1} | b_{0} |
M_{3} | M_{2} | M_{1} | M_{0} | . | E_{3} | E_{2} | E_{1} | E_{0} |
Note: Only the bits labeled b_{n} are actually stored.
Mantissa (m): four-bit signed integer (M_{3}:M_{0})
Exponent (e): four-bit signed integer (E_{3}:E_{0})
Value = m x 2^{e}
It is instructive to fill in some of the resulting values in our table:
pattern | value |
0100 1111 | 4 x 2^{-1} = 2 |
0100 1110 | 4 x 2^{-2} = 1 |
0011 1111 | 3 x 2^{-1} = 1.5 |
0011 1110 | 3 x 2^{-2} = 0.75 |
0010 1111 | 2 x 2^{-1} = 1 |
0010 0000 | 2 x 2^{0} = 2 |
0001 0001 | 1 x 2^{1} = 2 |
0001 0000 | 1 x 2^{0} = 1 |
0000 0000 | 0 x 2^{0} = 0 |
1101 1111 | -3 x 2^{-1} = -1.5 |
1010 1110 | -6 x 2^{-2} = -1.5 |
Using this convention, we immediately run into a very serious problem - namely one of uniqueness. With signed integers, there was one troublesome value that stemmed from the fact that we couldn't represent the same number of strictly positive values as we could strictly negative values. We dealt with it by simply expecting trouble unless we avoided the use of that one troublesome value. Here the problem is much more severe in that there are numerous ways to represent the same value - for instance we can represent the value 1 three different ways - and the troublesome values are scattered throughout the set of values.
Another serious problem is one of ordering. The entries in the above table are ordered according to the bit pattern, yet the values they represent bounce all around. This behavior, while not as fundamentally critical as the uniqueness problem, means that doing magnitude comparisons on floating point values will be a nightmare.
In short, this choice of representation is wholly inadequate, but by making a decision to try this representation and then looking at its ramifications we have identified two shortcomings which allow use to start a list of features that we would like our final representation to possess:
Since our initial representation lacks the two features identified, if we can identify the reasons why it lacks them we are well on the road to finding ways to modify the representation so that it possesses them.
The problem of uniqueness arises because our representation is not normalized. The term "normalized" means different things in different contexts, but generally it means mapping the values in a set to a preferred range or, in our case, representation. By normalizing the mantissa so that there is exactly one non-zero digit to the left of the radix point we eliminate most (if not all) duplicate representations.
The first thing we have to note is that the binary point is no longer to the right of the last bit in the mantissa. Instead, it is going to be located to the right of the second bit in the mantissa - the first bit is the sign bit and the second bit is the first bit in the magnitude. In our case that means that we have two remaining bits to the right of the binary point. Recall from the discussion on fixed-point numbers that we can think of this conveniently as being a four-bit integer divided by four (2^{2}). Don't forget that we need to adjust the exponents accordingly by adding two to all of them.
Floating Point Representation #2
b_{7} | b_{6} | (rp) | b_{5} | b_{4} | b_{3} | b_{2} | b_{1} | b_{0} |
M_{3} | M_{2} | . | M_{1} | M_{0} | E_{3} | E_{2} | E_{1} | E_{0} |
Note: Only the bits labeled b_{n} are actually stored.
Mantissa (m): four-bit signed integer (M_{3}:M_{0}) divided by 2^{2}
Exponent (e): four-bit signed integer (E_{3}:E_{0})
Value = m x 2^{e}
So how do we normalize a two's complement signed integer?
If the number is positive it is very straight forward. We simply multiply the mantissa by the base, and reduce the exponent by one (which is the same as dividing the value by the base), until the next-to-leftmost digit in the mantissa is a one (remember, the right most digit is the sign bit and needs to remain zero if the number is positive). Just as when multiplying a decimal value by ten, you multiply a binary value by two by shifting the bits one place to the left.
Example:
Problem: A value is represented, according to our first attempt's format, as 0010 0000. Find the normalized representation for it
Solution: Looking in the table from the last section we see that this is one of the representations for the value '2'. The first thing we need to do is make the adjustment for the location of the binary point. This is done simply by adding two to the exponent giving us 0010 0010.
The mantissa is 0010 and we can normalize this by multiplying by two giving us 0100 and decreasing the exponent by one making it 0001.
The result is 0100 0001
Check: According to our new format, the value represented by this pattern is:
value = (0100b)/4 x 2^{(0001b)} = ^{4}/_{4} x 2^{1} = 2
If the number is negative things are just as straight forward, but not quite as obvious. Consider how we knew when to stop multiplying by two in the case of a positive value. While it was convenient to say that we stopped as soon as the next-to-leftmost bit was a '1', what we really did was stop as soon as any further multiplications by two would result in an overflow - which is signaled by the corruption of the sign bit. If we apply this same approach to negative values we have a rule similar to that for positive values - namely that we multiply by two until the next-to-leftmost bit becomes a zero.
Example:
Problem: A value is represented, according to our first attempt's format, as 1101 1111. Find the normalized representation for it
Solution: Looking in the table from the last section we see that this is one of the representations for the value '-1.5'. The first thing we need to do is make the adjustment for the location of the binary point. This is done simply by adding two to the exponent giving us 1101 0001.
The mantissa is 1101 and we can normalize this by multiplying by two giving us 1010 and decreasing the exponent by one making it 0000.
The result is 1010 0000
Check: According to our new format, the value represented by this pattern is:
value = (1010b)/4 x 2^{(0000b)} = ^{-6}/_{4} x 2^{0} = -1.5
So what are the deficiencies that this version of our floating point representation has?
First, one of our eight bits, b_{6}, is forced to always be equal to '1' when the value is positive and '0' when the value is negative because of the normalization constraint. If it is not, we are right back were we started where we had many values that had multiple ways to represent them. So we need to limit our values and, in the process, we cut in half the number of distinct values that we can represent!
Second, while this representation has addressed the uniqueness problem, it has not addressed the ordering problem at all.
Third, we now have no way to represent zero. We will leave the fix for this problem until after we have addressed the ordering problem.
Fourth, we have made it quite difficult to negate a value because we can't simply take the two's complement of the mantissa. The reason is that the result may not be properly normalized requiring us to renormalize it.
The problem of ordering arises because we placed the mantissa to the left of the exponent even though the exponent exerts, by far, a more dominant influence on the magnitude of the value being represented than does the mantissa - with the exception of the sign bit. The sign of a number is the most dominant influence. If the signs of two numbers are different then the positive number is always larger than the negative number (where size means which value lies further to the right on a normal number line such that 0.01 is larger than -100).
To make our representation reflect this, we are going to split the sign bit off from the rest of the mantissa and place it as the leftmost bit. We'll place the exponent in the next bits, and the remainder of the mantissa will be stored in the rightmost bits. Having the bits in our mantissa distributed across our bit pattern could be annoying (at least for a human working with the representation) but it is not a major problem for the hardware designer.
Another problem we have is keeping track of whether our units bit needs to be a '1' or a '0' since this depends on the sign. We can get around this problem by using a signed binary representation - and in doing so we complete the separation of the sign bit from the mantissa because now the sign bit has no affect on the magnitude of the mantissa. Our resulting three-bit mantissa is always positive and therefore always has to be normalized to have a leading '1'. A further advantage of using signed binary for the mantissa is that the difficulty we identified with negating a floating point value disappears - we simply negate the sign bit and we are done.
This hasn't eliminated the problem of the fact that we have cut the number of values we can represent in half - but there is a rather clever way around this one. Since, except for the case of a value of zero, we know that the leading bit is always a 1, we simply assume that there is a '1' to the left of the binary point and don't even record it as part of the eight bits that get stored. As a result we get to record an additional mantissa bit from the right side of the binary point. Before getting too excited about this apparent "something for nothing" it should be pointed out that we have not increased the total number of values that we can represent - there are still only, barring any remaining duplicates, 2^{N} of them. Be sure not to forget the assumed '1' that exists to the left the binary point when working directly with the representations of floating point values. We still have to deal with zero, but we will hold off on that for a bit longer.
Coming back to the ordering concern, we see that if the exponent is represented in a two's complement fashion that we must treat both of the first two bits (the two leftmost bits) as special cases in our magnitude comparison since both S (b_{7}) and E_{3} (b_{6}) are sign bits. But what if the exponents were ordered so that all zeroes represented the smallest (most negative) exponent and all ones represented the largest (most positive) exponent? Now the same comparison operation used for signed integers would work for floating point values - at least as far as the sign bit and the exponent are concerned.
This leads us to use an offset binary representation for the exponent. If we use the most common convention for offset binary, then zero is represented by a leading '1' and the rest of the bits are '0's. Hence to find out the actual exponent, we would first read the exponent bits as a four bit unsigned integer and then subtract eight (1000b). Since the range of a four bit unsigned integer is from zero to fifteen, the range of our exponent becomes negative eight to positive seven.
Floating Point Representation #3
b_{7} | b_{6} | b_{5} | b_{4} | b_{3} | M_{3} | (rp) | b_{2} | b_{1} | b_{0} |
S | E_{3} | E_{2} | E_{1} | E_{0} | 1 | . | M_{2} | M_{1} | M_{0} |
Note: Only the bits labeled b_{n} are actually stored.
Sign Bit (s): S
Mantissa (m): 1 + [three-bit signed integer (M_{2}:M_{0}) divided by 2^{3}]
Exponent (e): four-bit offset binary integer (E_{3}:E_{0}) - 1000b
Value = (-1)^{s} x m x 2^{e}
Example:
Problem: What value is represented by 0000 0000?
Solution: Grouping the bits into their logical groupings, we have (0)(0000)(000).
The mantissa is m = 1 + [000b / 2^{3}] = 1
The exponent is e = 0000b - 1000b = -8
The result is (-1)^{0} x 1 x 2^{-8} = ^{1}/_{256} = 0.0039
The above example is the smallest (non-negative) value that we can represent. While this is not extremely close to zero, keep in mind that this is for an 8-bit floating point value. In practice, most floating point representations have sufficient bits in the exponent to represent non-zero numbers that are quite small. For instance, the IEEE Single Precision floating point standards uses thirty-two bits with eight bits for the exponent. It would therefore be able to represent numbers as small as 2^{-128} (3 x 10^{-39}) using the representation scheme we have developed thus far. A strong argument can be made that this is sufficiently close to zero for any purpose for which it would be reasonable to use a single precision floating point representation.
So what are the remaining deficiencies that this version of our floating point representation has?
First, we still cannot exactly represent zero. While we might think that we could probably live with the fact that we can't exactly represent zero, it turns out that it is actually rather useful from a programming standpoint if our floating point representation can exactly represent at least a limited range of integers, including most especially zero. In fact, the people that developed the floating point standards that we are working up to stipulated that the single precision floating point representation had to be able to exactly represent all of the integers that could be represented by the single precision integer representation.
Second, by using a signed-binary representation for the mantissa the magnitude comparison task has gotten a bit trickier. There is no problem for positive values and we can use the exact same hardware that we did for integer representations. But for negative values we have to add some additional logic for cases where the exponents in the two numbers being compared are the same. Remembering that all engineering involves compromises, we choose to accept the added burden - which is actually quite small - in exchange for the much simpler and faster negation ability. So this is no longer a deficiency, but instead an artifact of the representation that we have chosen to accept.
The sole remaining deficiency that we need to correct is the inability to exactly represent zero. So how do we achieve this?
Keeping in mind that the problem was caused by that hard-coded '1' due to the normalization of the mantissa, we look for a solution by seeing if we can "turn off" that normalization under the conditions where it is causing problems. What if we impose a special case where, for the smallest exponent, we assumed instead that the first digit was a zero? By "denormalizing" the mantissa for this one value of the exponent, we have turned the representation into a fixed point representation.
Floating Point Representation #4
b_{7} | b_{6} | b_{5} | b_{4} | b_{3} | M_{3} | (rp) | b_{2} | b_{1} | b_{0} |
S | E_{3} | E_{2} | E_{1} | E_{0} | ? | . | M_{2} | M_{1} | M_{0} |
Note: Only the bits labeled b_{n} are actually stored.
Sign Bit (s): S
Mantissa (m):
if (E3:E0) = 0000:
m = 0 + [three-bit signed integer (M_{2}:M_{0}) divided by 2^{3}]
else
m = 1 + [three-bit signed integer (M_{2}:M_{0}) divided by 2^{3}]
Exponent (e): four-bit offset binary integer (E_{3}:E_{0}) - 1000b
Value = (-1)^{s} x m x 2^{e}
Now if every bit in the mantissa is zero - and if we use the smallest (most negative) value for the exponent since that is the only time we assume a leading '0' on the mantissa - we have an exact representation for zero. Furthermore, because we used an offset binary representation for the exponent, the smallest exponent is represented by all 0's and hence it turns out that a pattern of all zeroes is exactly zero regardless of whether we are dealing with unsigned integers, signed integers, or floating point values.
It should also be pointed out that we still have exactly zero even if the sign bit is a '1'. So we have both a "positive zero" and a "negative zero". This is an artifact of the representation that could potentially cause problems if it is not kept in mind under certain circumstances.
Another nice result is that by having the fixed point representation available to us for the smallest values we increase the static range considerably - namely by the number of bits actually stored in the mantissa. For instance, in our current representation the smallest positive value we can represent is now:
Example:
Problem: What is the smallest strictly positive value that can be represented?
Solution: The pattern for this value would be 0000 0001.
Grouping the bits into their logical groupings, we have (0)(0000)(001).
The mantissa is m = 0 + [001b / 2^{3}] = 0.125
The exponent is e = 0000b - 1000b = -8
The result is (-1)^{0} x 0.125 x 2^{-8} = ^{1}/_{8} x ^{1}/_{256} = 0.00049
Notice that this value is smaller that the smallest representable value in our previous version by a factor of eight.
Life appears good but we have introduced a new problem. We have created a gap in the range of numbers that we can represent because we jump from:
0000 0111 = (0)(0000)(111) = 0.111_{2} x 2^{-8} = 0.003418
to
0000 1000 = (0)(0001)(000) = 1.000_{2} x 2^{-7} = 2^{-7} = 0.007813
as we change from the smallest exponent to the second smallest exponent. That's a change by more than a full factor of two without being able to represent any value in between. Consider the consequences for our dynamic range at this point.
To see this gap more closely, consider the next values on either side of it:
0000 0110 = (0)(0000)(110) = 0.110_{2} x 2^{-8} = 0.002930
0000 0111 = (0)(0000)(111) = 0.111_{2} x 2^{-8} = 0.003418
0000 1000 = (0)(0001)(000) = 1.000_{2} x 2^{-7} = 2^{-7} = 0.007813
0000 1001 = (0)(0001)(001) = 1.001_{2} x 2^{-7} = 2^{-7} = 0.008789
The difference between the first pair - the largest pair where both numbers are below the gap - is 0.000488 (2^{-11}) and the difference between the last pair - the smallest pair where both numbers are above the gap - is 0.000976 (2-^{10}). But the difference between the middle pair - the pair that spans the gap - is 0.004395 (basically 2^{-8}).
So we still have a deficiency in our representation. Like the inability to exactly represent zero, it might be tempting to ignore this problem and, in most applications, we would probably never encounter a problem as a result. But, like the inability to exactly represent zero, it turns out that we do not want to leave this deficiency uncorrected because, sooner or later, it will come back to haunt us.
The solution to our remaining deficiency - the loss of dynamic range as we cross the gap from an implied '0' to and implied '1' for the lead bit of the mantissa - is to have the two smallest exponent representations represent the same actual exponent. The difference is that the smallest representation has an implied '0' for the first bit of the mantissa and the second smallest, as well as all others, has an implied '1' for this bit.
Floating Point Representation #5
b_{7} | b_{6} | b_{5} | b_{4} | b_{3} | M_{3} | (rp) | b_{2} | b_{1} | b_{0} |
S | E_{3} | E_{2} | E_{1} | E_{0} | ? | . | M_{2} | M_{1} | M_{0} |
Note: Only the bits labeled b_{n} are actually stored.
Sign Bit (s): S
if (E3:E0) = 0000:
Mantissa (m): m = 0 + [three-bit signed integer (M_{2}:M_{0}) divided by 2^{3}]
else
Mantissa (m): m = 1 + [three-bit signed integer (M_{2}:M_{0}) divided by 2^{3}]
Value = (-1)^{s} x m x 2^{e}
It should be noted that the values having an implied zero in their representation do not have as good a dynamic range as the rest of the floating point values. The reason becomes fairly evident when we consider that a number with an implied one as the leftmost bit in the mantissa always has the full number of bits in the stored mantissa to draw upon as significant bits. But for a number having an implied zero as the leftmost bit, the number of significant bits is reduced by the number of stored zeroes at the left of the mantissa that are there only to act as place keepers in the fixed-point representation used for these values.
This feature - namely the extended static range due to denormalizing the representation - is often referred to as "gradual underflow" or "graceful underflow". The term "underfow" means that as we get to smaller and smaller values that we eventually end up with a value of zero regardless of what the actual value should be. If we are near the limits of the normalized representations and divide the number (by a value greater than one) then we would immediately have underflow if a normalized representation was required. In doing so, we would have lost all data associated with that value. By permitting the representation to switch to a non-normalized representation, we still retain most of the value although we do start losing significant figures. This is the gradual underflow.
Our representation above is very close to the actual floating point representation specified the IEEE-574 standard. In fact, there are really only two differences.
First, because of the increase in static range at the small end of the representable values, the IEEE elected to represent an exponent of zero by the pattern having a leading '0' and the rest of the bits '1'. This is only a shift of one in the pattern from what we developed above.
So, for our 8-bit representation above, it means that an exponent of zero is represented as 0111 instead of 1000. Therefore to get the actual exponent we would subtract seven (0111) from all exponents (when initially interpreted as an unsigned integer) except for the smallest one (pattern 0000). Since the next-to-smallest exponent (0001) has an actual value of:
(0001_{2}) - 7_{10} = -6_{10 }
the smallest exponent (0000) will have this same value.
The second difference is that the IEEE chose to interpret some of the patterns as special codes instead of as floating point values. This permits error detection and for errors, such as overflow conditions and division by zero, to propagate their way through a series of floating point operations so that they may be detected in the final result.
To do this, they used the largest exponent (all 1's) as a special flag - which is another reason they shifted the offset for the exponent downward. With this exponent, if the mantissa is exactly zero then the interpretation is that the result is "infinity", with "plus infinity" and "minus infinity" being distinguishable by the sign bit. For any non-zero mantissa the interpretation is that the result is "Not a Number" or "NaN" for short. This might result from trying to take the square root of a negative number, for instance.
If the IEEE recognized an 8-bit floating point format, the following would be it's specification.
IEEE-compliant Floating Point Representation
b_{7} | b_{6} | b_{5} | b_{4} | b_{3} | M_{3} | (rp) | b_{2} | b_{1} | b_{0} |
S | E_{3} | E_{2} | E_{1} | E_{0} | ? | . | M_{2} | M_{1} | M_{0} |
Note: Only the bits labeled b_{n} are actually stored.
Sign Bit (s): S
if (E3:E0) = 1111:
(0)(1111)(000) = +infinity
(1)(1111)(000) = -infinity
any other pattern is NaN
if (E3:E0) = 0000:
Mantissa (m): m = 0 + [three-bit signed integer (M_{2}:M_{0}) divided by 2^{3}]
otherwise:
Mantissa (m): m = 1 + [three-bit signed integer (M_{2}:M_{0}) divided by 2^{3}]
Value = (-1)^{s} x m x 2^{e }(value not defined for +/- infinity or NaN)
So let's see if we can use this knowledge to interpret and create floating point representations for one of the actual IEEE-754 formats.
The single precision format has a total of 32 bits including an 8-bit exponent.
This short description, plus the knowledge gained up to this point, is all that we really need to know.
We know that:
Furthermore, we know the following about the exponent:
And we know the following about the mantissa:
Finally, we know we can work with the mantissa as a fixed point value:
Most people have a lot of difficulty determining the floating point representation for a number or what number is represented by a particular floating point pattern. By understanding what the representation means, in the context of how it was built up, combined with an understanding of how to work with fixed point and offset binary representations, this process becomes very straight forward.
Example:
Problem: Convert π to its IEEE-754 Single Precision format:
The number is positive, so the sign bit is zero.
To get the bits in the mantissa, we first multiply the value being converted by 2^{23}:
Using a calculator, 2^{23}π is 26353589 to the nearest integer.
The hexadecimal representation of this integer is 0x01921FB5
The binary pattern for this is: 0000 0001 1001 0010 0001 1111 1011 0101
This pattern has 25 significant digits. To get it to have 24 (a leading '1' plus the 23 bits that will get stored) we have to divide it by 2. That means that the exponent has to increase by 1 to counter act this.
An exponent of zero has a pattern of 0111 1111, so an exponent of 1 has a pattern of 1000 0000.
The final binary pattern is therefore
π = 0100 0000 0100 1001 0000 1111 1101 1010 = 0x40490FDA
Example:
Problem: Convert the result of the previous example back into a decimal value:
Breaking the representation into its sign, exponent, and mantissa components:
(0) (1000 0000) (?100) (1001) (0000) (1111) (1101) (1010)
The final quantity is positive.
The question mark represents the implied bit. Since the exponent pattern is not zero, the implied bit is 1.
The full mantissa is therefore:
(1100) (1001) (0000) (1111) (1101) (1010) = 0xC90FDA
Converting this to an decimal integer: 0xC90FDA = 13,176,794
The exponent is: (1000 0000_{2}) - 127 = 0x80 - 127 = 128 - 127 = 1
The final value is v = (13,176,794 / 2^{23}) * 2^{1} = 13,176,794 / 2^{22} = 3.1415925
The double precision format has a total of 64 bits including an 11-bit exponent.
Based upon our understanding of the IEEE-754 floating point representation scheme, answer the following questions (in decimal):
As we did in the single precision case, we will first summarize the characteristics of this representation. But this time we will be a little more explicit in showing the link between our understanding of the desired characteristics and how we can use that understanding to build up the details of the representation from scratch:
We know that the whole representation is based upon an exponential format where we have a mantissa and an exponent.
Desired Feature: The ordering of values should depend on the bits from left to right, just as in an integer.
From this know that:
Because the sign bit and the mantissa are separated from each other, it makes sense that:
Since we want the ordering to proceed from the most negative exponent to the most positive exponent, we know that:
We would normally expect an exponent of zero to be 1...000 but the IEEE committee chose to use a value one less than this. We can remember this little fact a number of ways. We can remember that an exponent of "1" is represented as a leading '1' followed by all '0's. Alternatively, we could remember that for an n-bit exponent that zero is the largest binary pattern for n-1 bits. Yet another way is to recognize that the value we subtract off also happens to be equal to the largest exponent we can represent (after taking into account the fact that an exponent of all ones is a special flag value and not an actual exponent). Whatever makes the most sense to you is what you should use. The end result is that for this representation:
Likewise, the desire for proper ordering requires that we use a normalized exponential notation such that:
Because the stored part of the mantissa is a fixed point value with the binary point all the way to the left:
Desired Feature: We want to be able to exactly represent zero.
This leads us to de-normalize the mantissa for the smallest exponent, with the following result:
Desired Feature: We want to be able to trap special error conditions.
Since an exponent pattern of all zeroes is used to represent zero, we use the exponent pattern from the opposite end as our flag value:
One of the things we want to flag is infinity. We represent this case as a mantissa with all zeroes. We can remember this because, at least compared to NaN, it can almost be thought of as a specific value and therefore should have a specific representation. We can also think of it as the result of dividing by zero and so we use a mantissa of zero. Again, use whatever makes sense to you. The end result is that:
So now let's answer our questions.
This would be an all zero exponent and a single 1 at the right of the mantissa. So
value = 1 x 2^{-52} x 2^{-1022 }= 2^{-1074} = 4.94 x 10^{-324}
The mantissa is all ones which, combined with an implied leading '1', is essentially 2.0
The exponent is all ones except the last bit, so it is (2^{11} - 2) - 1023 = 1023
value = 2.0 x 2^{1023 }= 2^{1024} = 1.80 x 10^{308}
The value one has just the implied '1' and a stored mantissa of all zeroes. The smallest value by which we can change this is to make the right most mantissa bit a one. This is a change in value of:
x = 1 x 2^{-52 }= 2.22 x 10^{-16}
The static range is the largest value divided by the smallest value (in terms of magnitude):
static range = 2^{1024 } / 2^{-1074} = 2^{1098} = 3.4 x 10^{330}
The dynamic range is 1/x where x is the answer from question #3.
dynamic range = ^{1}/_{x} = 2^{52} = 4.5 x 10^{15}
By simplified example, 1.00 has two sig figs. The smallest change we could represent would be 0.01 and so our dynamic range would be 10^{2}. So, the number of sig figs is the base-10 log of the dynamic range.
sig figs = log_{10}(4.5 x 10^{15}) = 15.7 (roughly sixteen)
Any integer representable by the bits in the mantissa (including the implied leading '1') can be represented exactly. The upper limit is approached when we have a mantissa that is all ones and an exponent that is just large enough to make this an integer. If we add one to this, we end up with an all zero mantissa (except for the implied '1') and an incremented exponent. This is still exactly represented. However, the right most bit of the mantissa is now the 2's bit (instead of the 1's bit) and so we can't add just one and still represent it exactly.
Again using a simplified example, if we had three bits stored in the mantissa:
1.111 x 2^{3} + 1 = 1.000 x 2^{4} = 2^{(1 + number of bits stored in mantissa)}
So, for our present representation, the largest integer in a continuous range or representable integers would be
n = 2^{53} = 9,007,199,254,740,992.
A lot of information has been presented in this module. As much as possible, the viewpoint used in presenting the material was from a problem solving approach. This is particularly evident in how the IEEE floating point format was introduced. While most texts present the final result - if they present it at all - along with a recipe on how to use it, such an approach does little to foster your understanding of why the format is the way it is. Without this understanding, you would be relegated to using rote memorization - a tactic that quickly becomes counterproductive in the field of engineering. Even more importantly, for the purposes of this course, such a presentation would have conveyed little about the steps used in solving engineering problems.
The author would like to acknowledge the following individual(s) for their contributions to this module: