Minifloat

As minifloats refers to numbers in a floating point format with just a few bits. Minifloats are not suitable for numerical calculations, however, are occasionally used for special purposes or in training.

Minifloats with 16 bits are referred to as semi- precision numbers (as opposed to simple and double precision numbers). There are also mini floats using 8 bits or less. Many minifloats be defined according to the principles of the IEEE 754 standard and contain specific values for NaN and infinity. Normalized numbers are then stored with an excess exponent. In the revised standard, IEEE 754-2008 binary minifloats with 16 bit are included.

The G.711 standard for encoding audio data from ITU -T, the au audio files of the type., And is used for telephone connections, used in the so-called A- law encoding 1.3.4 - minifloats to a signed 13 - bit integer as represent 8- bit value.

Minifloats also be used in computer graphics to represent integers. If, simultaneously, the IEEE 754 - principles basis, the smallest denormalized number must be equal to one. This results in the to use excess value ( bias). The following example demonstrates the derivation and the underlying principles.

2.1 addition
2.2 Subtraction, Multiplication and Division

Example

A mini float in a byte (8 bits ) with 1 sign bit, 4 exponent bits and mantissa bits 3 (in short: a ( 1.4.3. -2) number, the clip contains all IEEE parameter) is the representation of integers according to IEEE 754 - principles can be constructed. These must be essentially the bias value is set appropriate. The (still) unknown exponent ( stored value e - bias value b ) is provisionally designated x. Numbers in other systems are identified with ... (): 5 = 101 (2) = 10 (5). The bit pattern is divided by space into its constituents.

Representation of zero

0 0000 000 = 0 denormalized numbers

The mantissa is supplemented with 0:

0 0000 001 = 0.001 ( 2) * 2 ^ x = 0125 * 2 ^ x = 1 (smallest denormalized number ) ... 0 0000 111 = 0.111 (2) * 2 ^ x = 0875 * 2 ^ x = 7 ( largest denormalized number ) normalized numbers

The mantissa is supplemented with 1:

0 0001 000 = 1000 (2) * 2 ^ x = 1 * 2 ^ x = 8 (smallest normalized number ) 0 0001 001 = 1.001 (2) * 2 ^ x = 1125 * 2 ^ x = 9 ... 0 0010 000 = 1000 (2) * 2 ^ ( x 1 ) = 1 * 2 ^ ( x 1 ) = 16 = 1.6e1 0 0010 001 = 1.001 (2) * 2 ^ (x 1) = 1.125 * 2 ^ (x 1) = 1.8e1 = 18 ... 0 1110 000 = 1000 (2) * 2 ^ ( x 13) = 1000 * 2 ^ ( x 13) = 65536 = 6.5e4 0 1110 001 = 1.001 (2) * 2 ^ ( x 13) = 1,125 * 2 ^ (x 13) = 7.4e4 = 73728 ... 0 1110 110 = 1110 (2) * 2 ^ ( x 13) = 1750 * 2 ^ ( x 13) = 114688 = 1.1e5 0 1110 111 = 1111 (2) * 2 ^ ( x 13) = 1875 * 2 ^ ( x 13) = 122880 = 1.2e5 ( largest Normalized number ) ( The illustrations on the right account for the accuracy, because of course you can store three bit less than five or six digits. )

Infinite

0 1111 000 = infinite Would be the numerical value of infinity without the IEEE 754 Interpretation

0 1111 000 = 1000 (2) * 2 ^ ( x 14) = 2 ^ 17 = 131072 = 1.3e5 (numerical value of infinity) Not numbers

0 1111 xxx = NaN Would be the numerical value of the largest NaN without the IEEE 754 Interpretation

If the denormalized number to be equal to the smallest one, it must according to the second line to be x = 3. This follows an exponent bias ( excess value) of -2. From the stored exponent must be subtracted each -2 ( 2 is added) to get to the theoretical exponent x.

Discussion of this example

The advantage of such integer minifloats in a byte is the much larger range of values from -122 880 ... 122 880 compared with representations in two's complement with -128 .. 127 For the accuracy drops off rapidly, as always, there are only 4 significant bit positions. Accordingly big are the gaps in the area of the largest normalized numbers.

This mini float representation can only represent 242 different numbers (if one 0 and -0 sees as different), because there are 14 different bit patterns that are not a number (NaN ).

Interestingly, the compliance of the bit pattern of mini- floating point numbers and integer numbers between 0 and 16, only the bit pattern 00010001 is interpreted as a mini float 18, however, as the integer 17.

For negative numbers, however, this correspondence is no longer correct, since negative integers are usually represented in two's.

Clearly you can see on the right side of the graph the varying density of the floating-point numbers on the ( vertical ) axis real number - a characteristic feature of all Gleitkommasysteme. This varying density due to exponentialfunktionsartigen course of the graph.

Although many computer scientists believe emotionally that the curve is continuously differentiable, one can clearly see the " kinks " at the points at each of which the exponent value changes. As long as the exponent remains constant, the floating-point numbers represented only by different mantissas are even distributed linear - between two " kinks " the curve is a straight line. Of course not, once there is a curve, because floating-point numbers represent only a discrete finite set of points. The statement thus refers to a possible good interpolated by the finite set of points curve. In practice, usually are as many points before that they act for the viewer as a continuous curve (for double are 264 there, so about 1019 points ).

Arithmetic with minifloats

Addition

The graph demonstrates the addition of two even smaller ( 1.3.2.3 ) minifloats with 6 bits. In this model, real numbers, all IEEE 754 principles are illustrated. A NaN operand or the bill " Inf - Inf " result in a NaN result. Inf can not change ( with finite values ) be enlarged or reduced. Also finite sums can have infinite results (14.0 3.0 ). The finite range (operands and result are finite ) is represented by the lines x y = c, where c is always one of the displayable mini float values .

Subtraction, multiplication and division

The remaining arithmetic operations can be represented similarly:

Multiplication

Division

As with all floating-point multiplications to about 25 percent of the results can not be displayed in numerical format of the operands.

Single-precision floating-point format Double-precision floating-point format IEEE 754 revision Digital Object Identifier

370285