Multiply–accumulate operation#Fused multiply.E2.80.93add

The fused multiply -add operation ( FMA ) operation is a variant of the MAC operation for floating point numbers and is used on some microprocessors with floating point unit for optimized calculations. In contrast to the ordinary, in English as Unfused - multiply -add operation, the fused multiply -add operation performs the calculation with full resolution and rounds only at the end of the calculation the result.

The technology was developed in the late 1980s by IBM Research, initially found only limited distribution. With the progressive integration density, a simple implementation of the FMA technology in GPUs, DSPs and CPUs was possible. The FMA operation is defined in IEEE 754-2008.

Application

In numerical algorithms often occur operation of the form

On. This is, inter alia in the evaluation of dot products in matrix operations and for the numerical integration of the case.

In the conventional Unfused - multiply -add operation on N points initially the product b · c is calculated, this rounded to N digits, then run the addition of a and the end result again rounded to N digits. During surgery, fused multiply -add - rounding after the multiplication is omitted, it is the expression a b * c calculated with full precision and rounded only at the end once to N final positions. This is associated with a higher amount of hardware in the fused multiply -add operation. The advantage is that affect less rounding error.

For evaluation, without FMA at least three different instructions required:

  • Loading of 'b' and 'c' in register
  • Multiplication of ' b' and 'c'
  • May be you done a caching this result to a register
  • Optionally charging of 'a' to the accumulator
  • Addition of 'a' with the previously cached value ' ( b * c ) '.

If special opcodes are defined for operations of the form, the evaluation is carried out by an optimized arithmetic unit, the multiplier - accumulator (MAC), executing this statement in one step. There remain from the above scheme, only two instructions, namely the loading of the operands and the subsequent FMA instruction.

Benefits

  • Enhanced floating-point performance by using the MAC
  • Improved utilization of registers, compact machine code

Disadvantages

  • The FMA technology must be supported by the compiler; the so -generating machine code now requires opcodes that differ from the usual 2 or 3 - address addressing scheme. The optimization of the use of FMA requires programmers sometimes some " finger dexterity " and explicit intervention.

Implementations

  • AMD FX Bulldozer (2011), see also FMA4
  • AMD Radeon HD 5000 (and subsequent architectures)
  • ARM VFPv4
  • Fujitsu SPARC64 VI (2007) and later
  • HP PA- 8000 (1996) and later
  • IBM RISC System/6000 (1990)
  • Intel Itanium (2001)
  • SCE Toshiba Emotion Engine (1999)
  • STI Cell (2006)
357217
de