IEEE floating point

The IEEE 754 standard (ANSI / IEEE Std 754-1985, IEC 60559: 1989 - International version) defines standard representations for binary floating-point numbers in computers and lay down detailed procedures for performing mathematical operations, in particular for curves, fixed. The exact name of the standard is IEEE Standard for Binary Floating English - Point Arithmetic for microprocessor systems (ANSI / IEEE Std 754-1985 ).

The current edition has been published under the name of ANSI / IEEE Std 754-2008 in August 2008 and in addition to the 754-1985 includes an extension for an additional two binary and decimal data formats. Next is the IEEE standard 854-1987, has been fully integrated with the title English standard for radix -independent floating - point arithmetic in IEEE 754-2008.

5.1 Zero
5.2 Normalized number
5.3 Denormalized number
5.4 infinity

7.1 Arithmetic and square root
7.2 conversions
7.3 Compare
7.4 Suggested operations

Overview

In the IEEE 754-1989 standard two basic data formats for binary floating-point numbers are defined with 32 bits ( single precision ) or 64 bits ( double precision) memory and two extended formats. The IEEE 754-2008 covers the binary number formats with 16 bits as a mini float, 32 bit as a single, 64-bit as double and new 128 bits. In addition, more were added the decimal representation with 32 bits as a mini float, 64 and 128 bits.

Finally, there have been proposals and implementations of other number formats that are designed according to the principles of the IEEE 754-1989 standard and are therefore often referred to as IEEE numbers, strictly speaking, according to the old definition, although they are not. These include the built- in new editions minifloats that are intended for training. Minifloats with 16 bits are sometimes used in graphics programming. This includes several not defined by IEEE 754-1989 number formats with more than 64 bits, such as the 80 -bit format (Extended Precision layout ... ), which the IA-32 processors internally in its classical floating point unit (Floating Point Unit using FPU).

General

The representation of a floating-point number

Consists of:

Sign s ( almost always 1 bit)
Base b ( is normalized floating-point numbers in IEEE 754 )
Exponent e ( r bits ), not to be confused with the " biased exponent " or the characteristic
, Sometimes called the mantissa m (p bits) as the signifier

The sign is stored in a bit S, so that S = 0 and positive numbers S = 1 marked negative numbers.

The exponent e is derived from the non-negative binary number E ( E is sometimes referred to as characteristic or biased exponent ) by subtracting a fixed bias value B: . The bias value ( engl: distortion) is calculated by, where r is the number of bits of e. The bias value B thus serves that negative exponent ( the characteristic E ) can be saved by an unsigned number, waiving any alternative encodings such as the two's complement. (see also excess code)

Finally, the mantissa is 1 ≤ m <2, a value as calculated from the p mantissa bits with the value of M. In simpler terms, one thinks at the Mantissenbitmuster M left a " 1 " appended: m = 1, M.

This process is possible because the condition for all representable numbers can always be maintained by normalization (see below). Since then the mantissa always starts with 1 on the left, this bit does not need to be saved. This one gets an additional bit accuracy.

For special cases, there are special bit patterns available. To encode these special cases, two exponent values , the maximum () and zero () reserved. With the maximum exponent value of the special cases NaN and ∞ are encoded. With zero in the exponent of the floating-point 0 and all denormalized values is coded.

Values outside the normal range of values ( too large or too small numbers) be ∞ or - ∞ shown. This extension of the range of values allowed in the event of an arithmetic overflow often a useful further calculations. In addition to the number 0, but there is the value of -0. While the outcome ∞ supplies, gives the value - ∞. When comparisons are not from 0 to -0.

The values NaN ( for engl. " Not a number ", " not a number " ) can be used as a representation for undefined values. They occur, for example as results of operations as ∞ or - ∞. NaN -NaN in signal ( signaling NaN, NaN ) for exceptional conditions and quiet NaN (quiet NaN, NaNq ) divided.

As a final special case fill denormalized numbers ( in IEEE 754r than subnormal numbers called ) the area between the smallest normalized floating-point number and magnitude zero. They are stored as fixed-point numbers and do not have the same accuracy as the normalized numbers. By design, most of these values are the inverse of ∞.

Number formats, and other specifications of the IEEE -754 standard

IEEE 754 distinguishes four representations: single-precision (single), just extended precision (single extended), double precision (double) and extended double-precision (double extended) number formats. When the extended formats only one Mindestbitzahl is prescribed. The exact number of bits and the bias value be left to the implementer. The basic formats are fully defined.

The number of exponent bits defines the range of representable numbers ( see below). The number of mantissa bits determines the accuracy of these numbers.

The last two examples demonstrate a minimal extended format.

For the specified formats results in the following restriction on the respective number range. The magnitude smallest numbers here are not normalized. The relative distance between two floating point numbers is greater than and less than 2 Specifically, the distance ( and in this case also the relative distance ) of the floating-point number to the next larger floating-point number 1 is equal to 2 decimal places describes the number of digits in a decimal number that can be stored without loss of accuracy. The mantissa is calculated by the implicit bit one greater than saved.

The arrangement of the bits of a single figure shown below. The concrete in a computer system arrangement of the bits in memory may differ from this image and depends on the particular byte order (little / big endian) and other computer peculiarities from.

The arrangement signed - Exponent - mantissa in that order takes (within a range sign ) the floating-point values shown in the same order as the representable by the same bit pattern signed integer values. This can be used as for the comparisons of Signed integers for comparisons of floating point numbers, the same operations. In short, the floating-point numbers can be sorted lexically.

However, it should be noted that for increasing negative signed integer values of the equivalent floating-point value goes to infinity minus, so the sorting is reversed.

Although in this article mainly the number format is discusses the importance of the IEEE 754 standard lies in the fact that for floating-point precise rules for

Curve
Arithmetic operations
Root calculation
Conversions
Exception Handling ( exception handling )

Have been established.

Examples

Calculate decimal floating point IEEE754 →

The number 18.410 is to be converted into a floating point number, while we use the single IEEE standard.

First convert the decimal to a dual fixed-point unsigned number

18.4 18/2 = 9 remainder 0 ( Least Significant Bit) 9/2 = 4 remainder 1 4/2 = 2 remainder 0 2/2 = 1 remainder 0 1/2 = 0 remainder 1 (Most Significant Bits ) = 10010 0.4 * 2 = 0.8 -0 (Most Significant Bits ) 0.8 * 2 = 1.6 -1 0.6 * 2 = 1.2 -1 0.2 * 2 = 0.4 -0 0.4 * 2 = 0.8 -0 0.8 * 2 = 1.6 1 ( Least Significant Bits ) * * * = 0.0110011001100110011 ... 18.4 = 10010.011001100110011 ... 2 Normalize and determining the exponent

Bias = 1x '0 ' ( R-1) x 1 ", for r = 8: 01111111 10010.011001100 ... * 2 ^ ( 01111111-01111111 ) = 1001.0011001100 ... * 2 ^ ( 10000000-01111111 ) = 100.10011001100 ... * 2 ^ ( 10000001-01111111 ) = 10.010011001100 ... * 2 ^ ( 10000010-01111111 ) = 1.0010011001100 ... * 2 ^ ( 10000011-01111111 ) Mantissa: 1.0010011001100 ... Exponent with bias: 10000011 Determine third sign bit

Positive → 0 4 form the floating-point

1-bit sign 8-bit exponent 23 bit mantissa 0 10000011 00100110011001100110011 → the decimal -one is left out as a hidden bit; because there is always a 1, then you need not to store it. Calculation IEEE754 floating point decimal →

Now, the floating-point number from above to be converted back to decimal again, so where is the following IEEE754 number:

0 10000011 00100110011001100110011 First calculation of the exponent

Converting the exponent to a decimal: 10000011 -> 131 But because this is the biased exponent, which has previously been moved by the bias, the bias will be withdrawn again: 131-127 = 4 is the exponent 2 Calculation of the mantissa

Since it is a normalized number, we know that she has a 1 before the decimal point: 1.00100110011001100110011 Now the decimal point 4 places to be shifted to the right: 10010.0110011001100110011 3 conversion to a decimal

The decimal point: 10010 (2) = 18 (10) decimal places: .0110011001100110011 ( 2) ≈ 0.39999961853 (10 ) (To obtain the value of after the decimal point, you have the same process carried out as if integers, but in the opposite direction. Thus, from left to right. Here, the exponent must be negative and start with a 1. In the form 0 * 2 ^ -1 1 * 2 ^ -2 1 * 2 ^ -3 0 * 2 ^ -4 0 * 2 ^ -5 1 * 2 ^ -6 1 * 2 ^ - 7 0 * 2 ^ -8 0 * 2 ^ -9 1 * 2 ^ -10 1 * 2 ^ -11 0 * 2 ^ - 12 0 * 2 ^ -13 -14 1 1 * 2 ^ -15 ^ * 2 0 * 2 ^ -16 0 * 2 ^ -17 1 * 2 ^ -18 1 * 2 ^ -19 ) 0.4 because the binary representation has a periodic, it can not accurately converted will 4 sign

Sign bit is a 0, so it is a positive number 5 decimal " put together "

18.39999961853 Interpretation of the number format

The interpretation depends on the exponent. For explanation, with S the value of the sign bit (0 or 1), with E the value of the exponent as a non-negative integer between 0 and Emax = 11 ... 111 = 2r -1, with M the value of the mantissa as a non-negative number and with B the bias value indicates. The numbers r and p denote the number of exponent and mantissa bits.

Zero

Zero represents the signed zero. And numbers, which are too small to be shown ( underflow) will be rounded to zero. Its sign is retained. Negative numbers are so small rounded to -0.0, positive numbers to 0.0. A direct comparison, however, are 0.0 and -0.0 considered equal.

Normalized number

The mantissa comprises the first n significant digits of the binary representation of the number of not yet normalized. The first essential point is the most significant ( i.e. left-most ) number, which is different from 0. Since a number other than 0 in the binary system can only be a 1, it must first 1 are not explicitly stored; according to the IEEE 754 standard, only the following numbers are stored, the first digit is a digit or implicit an implicit -bit ( English "hidden bit "). This bit is in a sense one " saved " space.

Denormalized number

If a number is too small to be stored in normalized form with the smallest non-zero exponents, so it is saved as " denormalized number ". Your interpretation is not ± 1, mantissa · 2exponent but ± 0, mantissa · 2de. In this case de is the smallest value of the "normal" exponent. This allows the gap between the smallest normalized number and fill zero. However denormalized numbers have a lower (relative) accuracy as normalized numbers; the number of significant digits in the mantissa takes from zero out.

If the result (or intermediate result ) an invoice for less than the smallest representable number of finite arithmetic used, it is generally rounded to zero; this is called underflow of floating-point arithmetic, English underflow. Since this information is lost, one tries to avoid underflow as possible. The denormalized numbers in IEEE 754 cause a gradual underflow (English "gradual underflow ") by 224 ( for single) or 253 ( for double) values are " close to 0 around " is inserted, all of which have the same absolute distance from each other and without this denormalized values representable would not, but would lead to underflow.

Processor side are denormalized numbers due to their rare occurrence proportional implemented with low priority and therefore lead to a significant slowdown of execution, as they appear as an operand of a calculation. To remedy the situation (eg, computer games ), Intel SSE2 has been providing the non- IEEE 754 compliant functionality of disabling denormalized numbers completely. Floating-point numbers that enter this field are rounded to 0.

Infinite

The floating point value " infinity " represents numbers whose absolute value is too large to be represented. It is between " infinity " and - " Infinity " distinguished. The calculation of 1.0 / 0.0 yields, by definition, also " infinity ".

Thus invalid ( or undefined ) results are presented, for example, when an attempt was made to calculate the square root of a negative number. Some " indefinite terms" have the result " not a number ", for example 0.0 / 0.0 or " infinity " - " Infinity ". Furthermore, NaNs are used in different application areas in order to represent "no value" or " Unknown value ". Specifically, the value with the bit pattern 111 ... 111 is often used for a " uninitialized floating point".

IEEE 754 requires two types of non numbers: quiet NaN ( NaNq - quiet ) and signaling NaN ( NaNs - signaling ). Both clearly no figures represent a signaling NaN triggers as opposed to a quiet NaN an exception (trap ) if it occurs as an operand of an arithmetic operation.

IEEE 754 allows the user to disable these traps. In this case, signaling NaN NaN are treated as silent.

Signaling NaN can be used to fill uninitialized computer memory so that each use an uninitialized variable automatically triggers an exception.

Silent NaN allow the handling of invoices, which can produce any results, perhaps because they are not defined for the specified operand. Examples are the zero division by zero or the logarithm of a negative number.

Silence and signaling NaN differ in the highest mantissa. In quiet NaN is this one, with signaling NaN 0 The remaining mantissa may contain additional information, such as the cause of the NaN. This can be helpful in exception handling. However, the standard does not specify which information is contained in the other mantissa bits. The evaluation of these bits is therefore platform dependent.

The sign has no meaning at NaN. It is not specified what value the sign bit in the returned NaN.

Curves

IEEE 754 distinguishes first between binary and binary - decimal rounding curves, where lower quality requirements apply.

For binary curves must be rounded to the nearest representable number. If this is not clearly defined ( exactly in the middle between two displayable numbers ), is rounded so that the least significant bit of the mantissa is 0. Statistically it on, rounded in 50 % of cases in the other 50 % of cases, so that the described by Knuth statistical drift is avoided in longer bills.

A conforming to IEEE 754 implementation has three adjustable by the programmer curves provide: rounding to infinity ( always round up ), rounding (always in terms of magnitude decrease result) against - infinity ( always round down ) and rounding to 0.

Operations

For IEEE 754 compliant implementations operations for arithmetic, calculation of the square root, conversions and comparisons must provide. Another group of operations are listed in the appendix, but not mandatory.

Arithmetic and square root

IEEE 754 requires a (hardware or software ) implementation exactly rounded results for the operations of addition, subtraction, multiplication and division of two operands and the operation square root of an operand. That is, the determined result must be that which arises in an exact execution of the corresponding operation with subsequent rounding the same.

Further, the calculation of the remainder after a division is called for with a whole number result. This residual calculation is defined by r = x - y * n, n an integer, abs ( n- x / y) <1/ 2 or ABS (n- x / y) = 1/2 and n is even. This residue must be determined exactly without rounding.

Conversions

Conversions are required between all supported floating-point formats. Event of a conversion to floating point accuracy with smaller needs as already described with arithmetic are exactly rounded.

For IEEE 754 compliant implementations conversions between all supported floating-point formats and support all integer formats have to provide. The integer formats are not specifically defined in IEEE 754.

Each floating-point format supported an operation must exist that converts these floating-point number in the exactly rounded integer in the same floating-point format.

Finally, conversions between the binary floating point and a decimal must exist the minimum quality requirements just described is sufficient.

Comparisons

Floating point IEEE 754 must be able to be compared. The standard defines the required comparison operations and for all kinds of special cases (mainly NaN, Infinity, and 0) the requested results. Compared to the " school math " compare ( less than, equal to or larger ) comes as a possible result according to IEEE 754 mainly unordered ( " not classified " ) if you do a comparison of the operands is NaN. Two NaN are different in principle, even if their bit pattern match.

Recommended operations

In the annex of the standard ten additional surgeries may be recommended. Since they are needed in an implementation basically anyway, this recommendation amounts to applying to disclose the operations to the programmer. These operations are (in C notation): copy sign (x, y), invertsign (x ), scalb (y, n), logb (x ), next after ( x, y), finite (x ) isnan (x ) x ≠ y, unordered (x, y), class ( x). The details of the implementation, especially again in the special cases NaN etc. are also proposed.

Exceptions, Flags and Traps

Join the calculation exceptions (exceptions) on, status flags are set. The standard stipulates that the user can read and write these flags. The flags are "sticky": they are once set, they remain in effect until they are explicitly reset. Checking the flag, for example, the only way to 1/ 0 ( = infinite) to be distinguished from an overflow.

It is also recommended in the standard to enable trap handler: If an exception to, the trap handler is called, instead of setting the status flag. It is the responsibility of such trap handler to set or clear the appropriate status flag.

Exceptions are divided into 5 categories in the standard: overflow, underflow, division by zero, invalid operation and Inaccurate. For each class there is a status flag available.

History

In the 1960s and early 1970s, each processor had its own format for floating point numbers and its own FPU or Gleitkommasoftware, with each format has been processed. The same program could provide different results on different computers. The quality of the various Gleitkommaarithmetiken was also very different.

Intel planned in 1976 for its microprocessors own FPU and wanted the best possible solution for the to-implement arithmetic. Under the auspices of the IEEE began meeting in 1977 to normalize FPUs for floating-point arithmetic for microprocessors. The second meeting was held in November 1977 under the chairmanship of Richard Delp in San Francisco. One of the pioneering participant was William Kahan.

By 1980, the number of proposals for the standard has been reduced to two: the KCS proposal ( according to its authors Kahan, Coonen and Stone, 1977) finally emerged against the alternative of DEC (F- format D format and G format ) through. An important milestone on the way to the norm was the discussion of the treatment of the lower reaches, which had hitherto been neglected by most programmers.

Intel implemented simultaneously with the development of the standard, the standard proposals largely in the Intel 8087 FPU which to 8088 was used as a floating point coprocessor. The first version of the standard was adopted in 1985 and expanded in 2008.

American National Standards Institute IEEE 754 revision IEEE 854-1987 Floating-point unit Significand Offset binary Single-precision floating-point format Double-precision floating-point format Donald Knuth Arithmetic underflow Digital Object Identifier

229681