Floating point

A floating point number (including floating-point or floating-point number, english floating point number) is an approximate representation of a real number ( exponential ).

The amount of the floating-point is a partial set of rational numbers. Together with them on defined operations ( floating point ) form the floating point numbers is a finite arithmetic, which was developed primarily in terms of numerical calculations with computers.

2.1 Base
2.2 mantissa
2.3 exponent
2.4 normalization
2.5 representation of the exponent sign, with or without bias
2.6 sign of the number
2.7 Summary of the parameters
2.8 Hidden bit

3.1 extinction
3.2 Numbers of different sizes ( engl. absorption )
3.3 underflow
3.4 Invalidity of the associative and distributive laws
3.5 solvability of equations
3.6 conversions
3.7 decimals
3.8 Testing for equality
3.9 Hidden use of other representations

4.1 Single and double precision
4.2 IEEE 754 and other standards
4.3 Internal Representation
4.4 Limitations and their mathematical foundations
4.5 Example: Calculation of floating-point
4.6 Calculation of IEEE single precision floating-point number (32- bit floating-point )
4.7 Calculation of IEEE double precision floating point number (64- bit floating-point )

Basics

Problem

Use all ( mechanical or electronic ) computing resource from the abacus to the computer as the simplest form of numerical representation fixed-point numbers. Here a most limited number sequence is saved and accepted at the specified location, the comma.

For larger bills is inevitable to overflows that make a scaling of the values and re- tracing required to bring final and all intermediate results in the allowed range of values. This scale is time consuming and needs to be automated.

Exponential

An obvious and directly to floating-point leading idea is to save the exact place of the comma at any value addition. This means nothing other than the mathematical representation of the number of two values of the mantissa and the exponent. The freedom of choice of the exponent can be used to bring the mantissa in a predetermined range of values, for example. This step is called normalization of the mantissa.

Example: The speed of light in vacuum c = 299,792,458 m / s = 299,792.458 · 103 m / s = 0.299792458 · 109 m / s = 2.99792458 · 108 m / s Only the mantissa of the last representation is normalized to the range [1, 10).

This notation has been used by physicists and mathematicians for a long time in order to specify very large and very small numbers can. Even today, therefore, is referred to specifically as a scientific calculators format ( sci ) the floating point notation.

Floating-Point Arithmetic

When performing calculations using floating point numbers, each number and each intermediate result is individually scaled ( as opposed to a global scaling ). The scaling ( calculation of the exponent ) of each intermediate result requires additional computational effort and was therefore avoided until well into the 1980s, if possible. Thus, the former PC had no floating-point standard. Another factor was the higher memory requirements of floating point numbers, which could be limited only by sacrificing higher accuracy. Accordingly, at first only the High Performance Computing ( number cruncher ) had a floating point or at least one hardware support of a software floating-point arithmetic.

The choice of the base 10 is arbitrary and only due to the usual decimal system. Floating-point numbers can be represented with arbitrary bases, in general, with a base arbitrarily chosen. Computer systems use ( predominantly ) (now rare) or ( eg on financial mathematics, see below). In any basis is the condition for normalized numbers.

Historical Development

The first documented use of floating-point is about 2700 years back: In Mesopotamia ( Mesopotamia ) scientific calculations were performed with the base and the exponent ( usually a small integer ) carried by a person in the head. The same procedure was common with a slide rule until recently in calculations.

In calculating machine for the first time by Konrad Zuse own floating point representation for its computers Z1 and Z3 was used.

Representation

In the previous section the basic parameters of a floating-point number have already been presented. There are basic, number of mantissa and exponent number of points. There are also other parameters to facilitate arithmetic operations in arithmetic. In this section, parameters and bit fields of a general floating-point number are briefly described.

Base

A parameter is the selected base. Numbers that can be processed by human use, either or. In this particular case, use the prefixes kilo for the exponent = 10001 = 10002 Mega, Giga = 10003, Tera = 10004 and Milli = 1000-1, 1000-2 = micro, nano = 1000-3, 1000-4 Piko = of the international system of units. The distance information from Regensburg to Wurzburg with 220 km can therefore be represented in the form of 220.0 · 10001 m.

In the computer, the dual system and its relatives have prevailed and there are the bases, and common. Since the standard for floating point IEEE 754, the base is used in modern computers almost exclusively.

For external floating-point numbers in the decimal notation of the computer itself is a form has prevailed, in which the base with the letter e in a double format is sometimes also labeled with d.

Example: The speed of light in vacuum c = 2,997.924.58 · 108 m / s = 2,997.924.58 e8 m / s

Mantissa

The mantissa contains the digits of the floating-point number. If you save more digits from, thus increasing the accuracy. The number of Mantissenziffern thus expresses how the exact number is approximated. This floating point parameter is either directly specified or described in the form of the smallest number that can be added to 1 of 1 and a different result provides (! Minimal) (see below for properties).

Example: For IEEE -754 numbers of type single with the base, the mantissa points is long. Here is 1.19209289551 e - 0007th

Exponent

The exponent stores by normalizing the exact place of the comma and thus the magnitude of the number. The number of exponent digits limits the possible variations of the comma and describes the value range of floating point numbers shown. In order to describe a system of floating point numbers, you would specify the smallest and largest possible exponent or the number of exponents and the shift to 0 ( bias).

Example: For IEEE -754 numbers of type single with the base of the lowest possible exponent -126 and the largest is 127 This is the largest representable floating-point number in this system and the smallest normalized floating-point number. These variables, and describe the range of acceptable values .

Normalization

The representation of a floating-point number is initially not uniquely determined. The number 2 can be written as or.

To force the use of a uniquely determined representation, therefore normalized floating-point numbers are often used in which the mantissa is brought into a defined area. The two natural normalization conditions are and. The number 2 would you write as say with respect to the first normalization condition, the presentation would not be allowed. Calculating with normalized numbers is easier, why would allow some implementors of a floating-point arithmetic only normalized numbers in the past. However, the number 0 can not be represented normalized.

There is a distinction - in relation to the usual base 10 number system in:

Scientific notation with consequent normalization to Example: 10000 = 1 ⋅ 104
Technical notation on normalization, with f as a power of the number of the remaining significant digits of the measurement uncertainty for the calculation accuracy ( denormalized digits). In the exponent only multiples of 3 appear - this representation can be in arithmetic with units is casual in the unit 's resolutions, as well as convert the digit grouping with thousand separator, drawing, or be produced from Example: 10,000 m = p5031010 × 103 m = 10 km Significance: p50310.0010, 00 × 103 = 10,000 m ± 5 m (4 significant digits with respect to measurement in kilometers with rounding ), but p5060.010, 01 × 106 m = 10,000 ± 5000 m ( 2 significant digits with respect to measurement in Mm follow the precise details of standard and expanded uncertainty DIN 1319-3 or ISO / BIPM Guide ( GUM, ENV 13005 ) - )
IEEE 754 ( floating point numbers for microprocessors) used in normalized numbers, the normalization condition and allowed between 0 and MinPos additional denormalized ( subnormal ) numbers.

Representation of the exponent sign, with or without bias

In Gleitkommasystemen the exponent is a signed integer. This requires the implementor an additional integer arithmetic with sign for exponent calculations. This extra expense can be avoided if the exponent of a fixed number, the bias value, or excess, is added, and instead of the exponent sum is stored. This sum is then an unsigned positive number. In most cases the use of a bias to the representation of the 0 combined with.

A rare today used alternative is the representation of the exponent in two's complement, one's complement, or as in the signed amount number.

The advantage of the Biased representation is that a size comparison between two positive floating point numbers is facilitated in this way. It is sufficient that number sequences em, so each exponent e followed by mantissa m, lexicographically to compare. A floating point subtraction and comparing to zero would be far more complex. The disadvantage of the Biased - representation to the two's complement representation is that after an addition of two biased exponents of the bias must be subtracted to obtain the correct result.

IEEE uses the representation with B = 127 for single and B = 1023 for double.

Sign of the number

The sign of v a floating point number ( or -, and 1 or -1) can always be encoded in a bit. Usually the bit for positive numbers ( ), and the bit is used for negative numbers ( - ) is used. Mathematically, one can write

Summary of the parameters

A second common spelling omits the sign bit and mantissa and exponent are only length: s23e8.

IEEE -754 double-precision numbers are 1.11.52.1023.2 or short s52e11 1:11:52 or floating point numbers.

Hidden bit

In the representation of normalized mantissa in the binary system, a bit can be saved. Since the first digit of a normalized number is always equal to 0, this point in the binary system is always equal to 1 A digit with the fixed value 1 must not be stored explicitly, since it is implicitly known. In this usage, we speak of a hidden bit. The aforementioned IEEE format for floating point numbers makes use of this saving option, but not the internal 80 -bit format of Intel CPUs.

However, the use of a hidden bit forces a separate presentation of zero, since each mantissa due to the hidden bit has a value > 0 represents.

Properties of a floating-point arithmetic

Floating-point wait especially for the mathematical layman with some surprises along, often affect the outcome of calculator and computer calculations. Most important are set common mathematical calculation rules expire. Those who work intensively with a computational tool must understand these characteristics. They go back to the limited precision used to store the mantissa and exponent. The consequence of this limitation becomes clear when you consider that the infinite number of real numbers by very many, but in any case, only a finite number of values are shown. One can imagine as long as floating-point table of fixed values . A Gleitkommafunktion then assigns each value of this table in their domain to a different value in this table. The same is true for two - and multi-digit operations. In the article minifloats these tables are displayed graphically.

This results in the light to absolute inaccuracy of the calculations and the defunct validity familiar mathematical calculation rules.

Extinction

Under extinction (English cancellation ) refers to the effect that when subtracting nearly equal large numbers, the result is wrong.

Example:

Subtracting and the exact number 3,141 in a four-digit floating-point arithmetic (,) so expected the unprejudiced layman as correctly rounded result.

In fact, the result obtained is: The four-digit rounded value of, so that is the result of the bill. This result occurs because even the output variables, especially in floating-point arithmetic is presented and not just be exact.

Numbers of different magnitude ( engl. absorption )

The addition or subtraction of a magnitude much smaller number does not change the larger number.

In the example of the four -digit decimal arithmetic (,) changes the addition of 1e- 3 to 1e2 at the larger operand anything. The same is true for subtraction.

Underflow

Since there is a smallest positive number in the floating-point below which no value can be represented, a result in this area is mostly represented by 0. In this case we speak of an underflow (English underflow). The result is a temporary result, so, of course, a value other than 0 is replaced with 0. Basically, by any knowledge of the result has been lost. In some cases, the accuracy of the final result will not be affected, but in other cases the resulting end result may be completely wrong.

Invalidity of the associative and distributive laws

The addition and the multiplication of floating-point numbers is not associative, that is, in general,

The addition and multiplication of floating point numbers is not distributive, that is, in general,

Solvability of equations

In a floating-point arithmetic, some normally unsolvable equations have a solution. This effect is even used in order to describe such a floating point number.

Example:

In the field of real numbers, the equation has no solution for.

In a floating point arithmetic, this equation has many solutions, namely, all the numbers are too small to give an effect even in the sum. Again, with the example of a four-digit decimal numbers (,) applies ( The bar | marks attributable to the addition sites ):

The already mentioned above smallest number that can be added to 1 of 1 and a different result provides (! Minimal) is called machine precision.

Conversions

When the base 10 is different from the numbers between the floating point number of the decimal system, and must be converted in order to obtain a human-readable representation. This is usually fast ( and often inaccurate) programmed. An already old and important requirement for this conversion is its bit-perfect reversibility. An illustrated the decimal result is to be read again and reproduce the same bit exact representation in floating point number.

This requirement is often ignored. An exception is Java, which observes the following sentence:

Sentence: It can be shown that it is not enough to round up the calculated result of the mantissa number of decimal places and rounded to produce these decimal places. A single additional point ranges however (Theorem 15). That is the reason why, always have an extra and seemingly superfluous point appears in the representation of real numbers, which are produced by Java programs.

Decimals

Even simple decimals, eg 0.1, can not be represented exactly as a binary floating-point numbers, as many breaking off the decimal point numbers are non-terminating, periodic numbers in the binary system; of these, only the first digits are stored, thus inaccuracy arises. 0.1 decimal is binary 0.0001100110011 ... But you can show that in a binary floating point number yet exactly 10 x 0.1 = 1 must yield. Generally, with proper rounding ( m / 10) x 10 = m (see Theorem 7). In disciplines such as financial mathematics often results requires that exactly match with one hand bill. This is only possible with a decimal floating point or (with some problems ) with a fixed-point arithmetic.

Test for equality

The said section decimal restriction that many decimal numbers can not be represented exactly in binary systems in computers, has programming effects on comparisons between floating point numbers. An example in C illustrates this:

# include " stdio.h " int main () { if ( .362 * 100! = 36.2) printf (" different \ n"); if ( * .362 100./100. ! = .362 ) printf (" too different \ n"); return 0; } Although the two equations and are mathematically correct, they are wrong due to inaccurate translation into computer binary system. In the example program so that both inequalities are considered true.

Comparisons must therefore be replaced by a query as to whether that can be regarded as equal values to be compared in the context of achievable accuracy (usually called tolerance).

Is tolerated when comparing an absolute error, is a possible formulation.

Is tolerated when comparing a relative error, is a possible formulation. The second case usually still has to be connected to the special case of query.

Even numbers with exactly the same bit patterns and thus actually exactly identical values are not considered by the calculator with sometimes as equal. This has to be the cause sometimes not identical formats in memory (eg: Intel 64-bit) and during a calculation in floating-point unit (eg: Intel 80 -bit). If the same bit pattern to be compared, once and thus rounded out the memory and even come from the FPU and thus with the full accuracy, a comparison to the wrong result. The remedy is the same as already described. For larger / smaller Compare can also encounter such a problem, depending on the language and architecture used are special instructions and / or a detour through the working memory to take to resolve this.

Hidden use of other representations

Some computer systems use when computing a number of different formats. For Intel processors, such as AMD and related FPU expects a 80 - bit format. Are stored the numbers with an IEEE -754 compatible 64- bit or 32- bit format. When using mmx / sse extensions other computing formats are reused. This leads to another for lay initially very opaque characteristics. A simple comparison exactly identical bit pattern to equality may lead to the conclusion that the seemingly identical bit patterns are yet different. The following program provides sometimes paradoxical results if you call it with the same value for x and y:

Void compare (double x, double y ) { if ( cos (x)! = cos (y)) printf (" paradoxical \ n"); else printf (" would every so expect \ n"); } The explanation is in such cases that the compiler in two steps must calculate the two cos values , one of which of course he does not know at this point in the program that they are identical because of the call. If the internal registers are not sufficient, one of the values cos must be cached. When the caching is as Intel 64-bit, and thus a further rounding the second cosine calculation as the first but is performed with 80 bits, if the two results are then compared, then have the rounded 64-bit value, and rounded in a different 80 -bit value no longer match.

Binärgleitkommazahlen in digital technology

Above mentioned examples are consistently given in the decimal system, ie with a base b = 10 computer instead use the binary system with a base b = 2

Single and double precision

Floating point numbers are usually represented as sequences of 32-bit ( single precision, english single precision ) or 64 bits ( double precision, double precision) in computers.

Some processors also allow extended floating-point numbers, as derived from the Intel x86 series processors ( inter alia Intel Pentium and AMD Athlon ) know a Gleitkommazahldarstellung with 80 bits for intermediate results. Some systems also allow floating-point numbers with 128 -bit ( four-fold accuracy). Some older systems used also other lengths such as 40 bit.

There are also minifloats called systems with very few bits ( about 8 or 16), the in -memory poor systems (controllers ) are used or limited data streams (eg graphics cards).

IEEE 754 and other standards

The most common and popular today oating point was designed in 1985 by IEEE, set forth in the IEEE 754, and is in most computers as hardware or software arithmetic available. IEEE 854 is a standard for floating-point decimal numbers. Both standards are merged in audit IEEE 754r and expanded.

The IEEE has the representation of floating point numbers in their standard IEEE 754 regulated since 1985; almost all modern processors follow this standard. Counter-examples that do not meet the specifications of the IEEE 754 standard, some IBM mainframe systems ( Hexfloat format), the VAX architecture and some supercomputers such as the Cray. The Java language is closely related to IEEE 754, but met the standard is not complete.

The definition of Hexfloat format of IBM is in the book "Principles of Operation" of the z architecture.

The Power6 IBM is one of the first processors that have implemented decimal floating point arithmetic in hardware; the base is thus 10 In the following, but only the base 2 is treated.

Strictly speaking, only the normalized numbers from IEEE 754 floating point numbers. The denormalized numbers are actually fixed-point numbers; these special cases were created for specific numerical purposes.

Internal Representation

The actual representation in the computer thus consists of a sign bit, some exponent bits and some bits mantissa. Wherein the normalized mantissa is usually and numbers in the interval [ 1; 2 [ represents. ( Since in this interval, the first bit with the value one is always set, it is usually implied and not stored, see Hidden bit). The exponent is usually shown in Biased format, or in two's. Furthermore, to represent special values (zero, infinity, NaN ) usually some exponent values , for example, the maximum and the smallest possible exponent reserved.

An f-number is thus represented as F = S · m · 2 e, wherein s is 1 or -1.

Limitations and their mathematical foundations

Due to the different binary representation of numbers may occur in both systems to artifacts. This means that (rational ) numbers, the "round" appear in the decimal system, for example, can not be represented exactly (value ) in the binary system. Instead, their binary representation will be rounded under the relevant calculation accuracy, so as to obtain the decimal system, for example, the value 12.44999999900468785 subsequently converted back. This can result in subsequent calculations unexpected decrease or Aufrundungsfehlern.

The above-mentioned artifacts are unavoidable in the binary system, as many numbers that can be represented exactly in the decimal system, the binary system are periodic numbers with infinitely many decimal places. They can only be avoided by the use of codes to the base 10 (or other bases to the shape of any ), also refer to the BCD code. Binary floating-point numbers are, however, still used for various reasons.

However, there is for each base d infinitely many rational numbers to other bases have a finite representation ( 0 period) and to the base d representation of an infinite (that is, a real period). The decimal system is only distinguished by the fact that people are accustomed to, and therefore for input and output format of the invoices often the decimal is preferred.

In mathematics, a Gleitkommazahlensystem is a tuple, wherein the base, the area of the exponent and the mantissa represents the length.

Thus is a real number x ≠ 0 represented by a and a e, such that: and.

Herewith a mathematical consideration of the rounding error is possible. The above presentation implements a projection

And thus, the rounding error is defined as

For double values corresponds precisely to (approximately).

Example: Calculation of floating-point

18.410 to convert to a floating point number, here we use the single IEEE standard (IEEE 754, binary32 ). First calculation of excess

( The excess or bias is a standard for the number belonging constant. For this purpose, count the bits that are reserved in the number representation for the exponent in the IEEE 754 standard that is 8 digits. )

Excess = 2 ( n-1) - 1 ( n bits of the exponent in the number representation ) = 2 ( 8-1) - 1 = (27) - 1 = 128-1 = 127 2 conversion of the decimal number in a dual fixed-point unsigned number

Floating-point = 18.4 Digits before the comma = 18 18/2 = 9 remainder 0 ( Least Significant Bit) 9/2 = 4 remainder 1 4/2 = 2 remainder 0 2/2 = 1 remainder 0 1/2 = 0 remainder 1 (Most Significant Bits ) = 10010 Decimal = 0.4 0.4 * 2 = 0.8 → 0 (Most Significant Bits ) 0.8 * 2 = 1.6 → 1 0.6 * 2 = 1.2 → 1 0.2 * 2 = 0.4 → 0 0.4 * 2 = 0.8 → 0 0.8 * 2 = 1.6 → 1 (Least Significant Bits ) • • • = 0.0110011001100110011 ... 18.4 = 10010.011001100110011 ... 3 normalizing

10010.01100110011 ... * 2 ^ 0 = 1.001001100110011 ... * 2 ^ 4 4 Calculation of the dual exponent

Because 2 ^ 4 → exponent = 4 Exponent excess 4 127 = 131 131/2 = 65 residue 1 (Least Significant Bits ) 65/2 = 32 remainder 1 32/2 = 16 remainder 0 16/2 = 8 remainder 0 8/2 = 4 remainder 0 4/2 = 2 remainder 0 2/2 = 1 remainder 0 1/2 = 0 remainder 1 (Most Significant Bits ) = 10000011 5 determine sign bit

The sign is calculated from the formula (-1) ^ s: positive → 0 negative → 1 = 0 6 form the floating-point

1-bit sign 8-bit exponent 23 bit mantissa 0 10000011 00100110011001100110011 → the decimal -one is left out as a hidden bit; because there is always a 1, then one needs not to store it Calculating an IEEE single precision floating-point number (32- bit floating-point )

Here are the exact computational steps are introduced in order to convert a decimal number to a binary floating-point number of type Single IEEE 754. To this end, one after the other, the three values ( sign (1- bit), mantissa and exponent ) are calculated, which make up the number:

Depending on whether the number is positive or negative, is 0 or 1: or

All other calculations are done with the magnitude of the number.

Next, the exponent is stored. In the IEEE single data type 8 bits are intended. The exponent must be chosen such that the mantissa is given a value between 1 and 2:

Here, if a value for the exponent comes out of -126 or greater than 127 is smaller, the number can not be stored with this data type. Instead, the number other than 0 (zero ) or is "infinite" saved.

The value of the exponent is not stored directly, but increased by a bias value, in order to avoid negative values. In IEEE single of the bias value is 127 Thus, the exponent values are -126 ... 127 " characteristic " between 1 ... 254 stored as so-called. The values 0 and 255 are reserved as characteristic for the specific numerical values of " zero ," " Infinity " and " NaN ".

The mantissa is now stored in the remaining 23 bits:

Number = 11.25

Sign = → 0binär

→ 3 127 = 130 → 10000010binär

→ 01101000000000000000000binär

Thus, the following single precision floating point:

0 10000010 01101000000000000000000

If you want to from a floating-point number in the machine word (32 bits) to calculate a decimal number, so you can do this with the following formula pretty quickly:

Calculating an IEEE double precision floating point number (64- bit floating-point )

To calculate from a floating-point number in the machine word (64 bits) to decimal, you can use the following formula:

Example:

The following binary number with 64- bit is to be interpreted as a floating-point number:

0 10001110100 0000101001000111101011101111111011000101001101001001

( the leftmost bit is bit 63 and bit is rightmost bit 0)

Bit 63 represents the sign ( one bit), that is

VZ = 0binär = 0 Bits 62-52 represent the exponent (11 bits ), so:

E = 10001110100binär = 1140 Bits 51-0 represent the mantissa ( 52 bits ), so:

M = 0000101001000111101011101111111011000101001101001001binär = Used in the formula is obtained as the result ( rounded values):