MMX (instruction set)

The Multi Media Extension ( MMX short ) is an early 1997 launched by Intel to market expansion for the IA -32 processor architecture, which allows large amounts of data parallelization and thus faster to process. The parallelization is achieved by the SIMD architecture of MMX, always on several data are applied simultaneously to the commands, which brings a performance advantage especially in the processing of audio and video data with it. Originally the abbreviation stood for MMX Matrix Math Extensions, however, was from Intel in the marketing, Multi Media Extension renamed.

MMX uses no new physical processor registers, but does the register of the mathematical coprocessor ( FPU) to. Intel designed 57 new processor instructions for MMX and introduced four new data formats. A major advantage of MMX compared to conventional method of calculation is to support saturation arithmetic.

MMX was reluctant support of its launch of the software industry and was already after three years by Intel's own further developments SSE and AMD's 3DNow! obsolete. Benchmarks its performance exhibited a large bandwidth.

  • 3.1 Arithmetic instructions
  • 3.2 Logical Operations
  • 3.3 Data exchange
  • 3.4 Data Comparison
  • 3.5 Data Conversion
  • 3.6 MMX status
  • 3.7 exceptions
  • 8.1 Extension of the MMX instruction set SSE under
  • 8.2 SSE2 to SSE4

Requirements of multimedia applications

The requirements of multimedia and communication range set to a computer system and thus the processor partially different and new requirements. The data to be processed are usually highly parallelizable. Thus, the operations for the many individual image points are identical, for example, in a video editing. In theory, would be optimal here the execution by a single law applicable to all the points command. The operations required are often not simple, individual statements, but rather more extensive chains of command. The insertion of an image and the background, for example, a complex process of mask formation by XOR, preparation of the background using AND and NOT, as well as the superposition of the fields by OR. These requirements are met by the provision of new complex instructions. Thus united, for example, the MMX instruction PANDN an inversion and AND operation of the form x = y AND ( NOT x ).

Realization

Intel MMX created with a new approach to the use of existing registers, new data formats, an extended instruction set and the choice between different arithmetic options ( Saturation mode and wrap-around mode). Small internal, not question the command scope differences between the (not officially so designated ) MMX versions 1.0 and 2.0 of the various Pentium processors. Developed yet much the MMX approach found in the ASICs (where it originally came from ) and in the AltiVec units of modern PowerPC CPUs - or to graphics cards.

New data formats

There were four new data formats for MMX: PackedByte, PackedWord, PackedDoubleWord, and quadword created, with which it is possible to process up to 64 bit wide integer data packets at a time. These formats are only other names for pre-existing formats in principle. The new nomenclature indicates that with MMX not individual data, or numbers, but data fields can be edited. In principle, a quadword is only a 64 -bit field, which could also have been DoubleLongInt can call; a ShortPackedWord is actually a ShortPackedInteger.

Register usage

For data manipulation additional 64 -bit register MM0 to MM7 created which are physically identical but with the 80 -bit registers R0 to R7 of the FPU. It ( ie only the mantissa of FPU values ​​) used by MMX only eight per byte of every ten byte-wide FPU registers. The two remaining bytes are set under MMX on the hexadecimal value FFFF. The other FPU registers as 16 -bit-wide control, status, and tag register, the 11 -bit option register and the two 48 -bit wide pointer register have no or very rare occasions with MMX applications restricted or otherwise to interpret meaning of the resultant values ​​.

Switching between FPU and MMX

Before changing to a MMX application should be first checked whether SIMD Extensions and MMX is specifically supported by the system. This is possible through the available since the Pentium CPUID command under passing the value 1 in the EAX register.

Mov eax, 1; There is the feature flag to be queried   CPUID; Execute CPUID instruction   TEST edx, 00800000h; If bit 23 is set in register edx?   JNZ MMX_kompatibel; If so, then the processor is MMX - compatible If you want to use after a positive test based on MMX - ability of this, the FPU data by the FXSAVE command should be stored in a 512 byte memory area next. Using the two unused bytes of MMX in an MMX register each application is secured, i.e., a FPU application displayed. However, an explicit command to change an MMX application does not exist. May occur during an FPU MMX application is sent to the status of NaN ( Not a Number). Disturbing FPU instructions remain usually inconsequential.

After completing the application, the previously backed by FXSAVE FPU data should be restored by FXSTOR. To signal a release from MMX to FPU pending applications also exist, however, not mandatory and not always necessary MMX instruction EMMS. It can also within a MMX application - for example, if a MMX application calls an API, which in turn uses the FPU instructions - be necessary.

Use in operating systems

In multitasking operating systems all register contents must be secured in a special section of memory on a task switch. Since a change of this memory area of all operating systems would need to be supported, the MMX was a "trick" used also allowed without operating system support: There were imaged MMX registers to the eight floating-point registers of the FPU to the outside. So that the actual FPU registers are not available when a program MMX used. Newer instruction set extensions such as SSE use entirely separate registers and thus require a mandatory support of the operating system. Nor can the overlapping of the floating-point registers by the MMX registers for newer processors off.

Saturation mode and wrap-around fashion

The MMX instruction set includes instructions that apply the Saturation mode and commands that work in the fashion wrap-around. Thus, for example, the MMX instruction PADDB an addition of two packed- bytes in wrap-around fashion, while PADDSB The same makes in saturation mode.

The saturation mode means that a number is not "overflow " at their largest or smallest value is exceeded, but accepts this largest or smallest possible value.

An application example: For a fade-out effect of images could be, for example, there are two pixels darken with 32 -bit color depth at the same time to a certain value. Due to the saturation does not control you, whether the pixels are already black ( Examples: or ). This, and the parallel processing of multiple values ​​, the speed of the calculations are considerably increased.

In the wrap-around mode, the carry is ignored when an overflow or underflow. Thus, when a maximum value of one byte ( in decimal 256) results in the addition of the result of the first binary result expressed by the most significant bit is ( here in brackets) taken into account, which (ie, decimal 1 ) leads to the result, 00000001.

Specifying the operand

A major difference between FPU and MMX applications is the form in which the commands receive their operands. Behind FPU - commands are no explicit operands. This is fetch commands via a stack pointer ( top of the stack) from bits 11-13 of the status register. MMX instructions operate on the other hand, as well as CPU instructions with explicitly specified by the instruction operand.

An MMX instructions can not, have one or two source and destination operands. This can be MMX registers (MMX ), general-purpose registers ( Reg), locations (Mem ), or constants ( Const ) of different sizes (16, 32, or 64 bits, 8). Which operands are allowed for a specific command is different and noted in reference books. A reference such as

Command MEM32, MMX MMX instruction, Reg32 Command Reg32, MMX would, for example, say that the operation of a MMX registers to a 32 -bit general-purpose registers, a 32 -bit general purpose register ( command ) for a MMX register and vice versa.

Time behavior

Most MMX instructions are processed in a single processor cycle. The multiply instructions take three cycles until the result is available, but it can after each cycle a new multiplication be pushed into the pipeline (Pentium MMX to Pentium III).

Instruction set

A total of 24 new commands to deal with the different types of data, resulting in the 57 commands specified by Intel. Of these 24 commands, some only visible differences into account the sign and the Überlaufart, so that in principle leave only 15 basic operations.

Since MMX works with packed data, start most commands - in order to distinguish them from the beginning with F FPU instructions - with a P. MMX commands consist except the leading P option of the letters B, W, D or Q for the data format, a CPU -like command word (such as ADD, or CMP) and signed from S for U.S. or for unsigned saturation mode. So states such as the command PADDSW: P for Packed, ADD for addition, S for signed Saturation mode applied for the date of Words. The MMX instruction set includes instructions for:

  • Arithmetic manipulation of data
  • Logical manipulation of data
  • Data exchange
  • Data Comparison
  • Data conversion
  • MMX status

Detailed questions about the instruction set are the Intel Architecture Software Developer's Manual, Volume 2 - refer instruction set, see the section literature.

Arithmetic instructions

For the addition of the wrap-around fashion, there are three commands ( PADDP, PADDW, PADDD ) for the data types PackedByte, PackedWord, and PackedDoubleWord. In saturation mode commands for the signed ( PADDSB, PADDSW ) and unsigned exist ( PADDSUB, PADDUSW ) Addition of PackedBytes and PackedWords. A command for the addition of double words does not exist. In either mode, no indication of a will over-or underflow of the range, for example by setting flags given.

The commands for subtraction are similar in design to the addition.

In the multiplication is the problem that the results may exceed the size of the register of 64 bits. This has been achieved in that the high-order and low-order portion of the result is stored in two different registers. For multiplication and the use of the low proportion PMULLW is used ( Multiply Packed Word and Store Low) and for the high-order portion PMULHW ( Multiply Packed Word and Store High).

The PMADDWD instruction multiplies the four pairs of 16 -bit words, and added in pairs on the results.

The commands for shift work, with the exception of the non-set in MMX thereby carry flags, analogous to the shift commands the CPU such as SLL, SRL and SRA. They are only on Words, Double Words and Quad Words, but not on bytes applicable. For the logical left shift PSLLW and PSLLD and for the reverse direction PSRLW and PSRLD be applied. For the arithmetic shifting PSRAW and PSRAD are available for the logical shift of Quad Words PSLLQ and PSRLQ.

Logical operations

The bit manipulation instructions are identical to the CPU instructions AND, OR and XOR, only of them is 64 bit, so a Quad Word, edited at a time.

An MMX equivalent of the CPU instruction NOT does not exist. The only MMX instruction without correspondence in the CPU instruction set is PANDN, which is a negation of the first operand followed by ANDing with the second operand in the following form: x = y AND ( NOT x )

Data exchange

These exist analogous to the CPU instruction MOV the two commands MOVD and MOVQ for double words and Quad Words. Due to the computer architecture - that is, the different size of 64 -bit MMX registers, 32 -bit general-purpose registers and the 32- bit address bus - are subject to both commands certain restrictions on the allowed operands.

MOVD can not be used to exchange data between two MMX registers, since there are only 64- bit data for MMX registers. Thus, it allows only the exchange between a MMX registers and 32 - bit general purpose registers and memory locations in both directions. The possible configurations are the following:

MOVD MMX, MEM32 MOVD MEM32, MMX MOVD MMX, Reg32 MOVD Reg32, MMX In this case, only the lower bits 0 to 31 of the affected MMX register. When moving data from an MMX register so only those bits are used. When moving data in the MMX registers the higher proportion (bits 32 to 64 ) is deleted, ie set to zero.

MOVQ allows a bidirectional data exchange between all 64-bit MMX registers and memory locations. Data exchange with the 32 -bit general-purpose registers is not provided. The possible forms are thus:

MOVQ MMX, MMX MOVQ MMX, Mem64 MOVQ Mem64, MMX data Comparison

The MMX instructions for data comparison are less flexible and more powerful than the corresponding CPU and FPU instructions. It is only intended to test both operands for equality or to examine whether the value in the first operand is greater than the second. Both comparisons are available for the three formats byte, word and double word. Thus, the following commands are built: PCMPEQB, PCMPEQW, PCMPEQD, PCMPGTB, PCMPGTW and PCMPGTD (EQ stands each for equal, GT for Greater). As the first operand only one MMX registers, as the second one MMX register or a 64 -bit memory location is allowed.

A major difference to the CPU and FPU is the way in which the result of the comparison is passed. It is not (eg in the status register of the FPU) indicated by setting flags or set individual bits, but in the first operand - stored - ie a MMX register. Performs the comparison to a true result, there is the hexadecimal value FF and FFFF or FFFFFFFF is entered. Zeros are inserted in the other case. A comparison of two double-words for equality by PCMPEQD MMx mmy could be in his sequence therefore expressed as follows:

IF MMx [ 31 .00 ]> mmy [ 31 .00 ] THEN MMx [ 31 .00 ]: = $ FFFFFFFF                                  ELSE MMx [ 31 .00 ]: = $ 00 million;   IF MMx [ 63 .32 ]> mmy [ 63 .32 ] THEN MMx [ 63 .32 ]: = $ FFFFFFFF                                  ELSE MMx [ 63 .32 ]: = $ 00 million; data conversion

MMX instructions allow you to convert a date to a smaller or larger, with a conversion to a smaller data format of course always has a loss of data.

  • To convert to a smaller date are the commands PACKSSWB, PACKSSDW and PACKUSWB for converting Word to Byte and Double Word to Word. In order to protect the sign while the most significant bit of the target date is not used. Thus, only half of the range of values ​​is available. Therefore, the commands saturate values ​​that exceed this range above or below. So, for example, sets PUNBKHBW all under -128 border values ​​to -128 and all 127 border to 127 PACKUSWB ( Pack with Unsigned Saturation Word to Byte) considered the sign not saturated yet.
  • The conversion into a larger format of Byte to Word, Word to Double Word and Double Word possible to Quad Word. In each case, a command for converting the low-order and high-order part of the data exist: The first cover the three commands PUNPCKHBW, PUNPCKHWD and PUNPCKHDQ from, the latter PUNPCKLBW, PUNPCKLWD and PUNPCKLDQ.

MMX status

The three commands to MMX status EMMS, FXSAVE and FXSTORE have no operands. EMMS is a kind of clean-up command after having MMX application. FXSAVE and FXSTORE each for backing up and restoring FPU - specific data, flags and registers, see also the section break between FPU and MMX.

Exceptions

Since MMX instructions are not fundamentally different from CPU instructions, they can always trigger the same exceptions. FPU - specific floating point exceptions in question, such as emergency situations at Denormalisierungen can when using the register does not occur by MMX.

CPUs with MMX

Since MMX is the first expansion of the x86 architecture, actually possess all the CPUs in recent years MMX. For a complete list of all CPUs with MMX would therefore beyond the scope. It should be, however, referred to the list of microprocessors.

Below is an overview from which CPU family the respective manufacturers have integrated MMX:

Programming languages

In order to implement the extended and increased potential of a new processor concept as MMX optimized application software, it is necessary that the extended possibilities of the machine language are also supported by the new versions of the various high-level programming diverse levels of abstraction and their compilers.

The languages ​​may simply implement the capabilities of MMX in the compilation process, but not to extend the instruction set of the particular language on the one hand. For the programmer, this changes very little, he must specify only for backward compatibility before compilation whether MMX is to be used in the target code or not.

However, a language can also enhance their instruction set and new, specifically implement the strengths of MMX supporting concepts and instructions for writing the source code. Thus, for example, Free Pascal predefined array types specifically for MMX and 3DNow! ready. Vector Pascal allows parallel operations on data.

In low-level language section of the Microsoft Macro Assembler already supported nine months after launch of MMX in version 6.12 of the new possibilities of MMX. Also the Flat Assembler NASM and MMX supported later. Intel supported in their own C compilers and later in C MMX relatively quickly. Also the VectorC compiler code Play supports vectorization and optimized C source code with the translation for MMX. Other programming languages ​​were later followed with the implementation of the capabilities of MMX.

Use in software

MMX was, as well as AMD's 3DNow!, Not used as much as hoped by Intel scope of the software industry. Few of the products is a phrase such as " Optimized for MMX " explicitly to find. Most likely it was still as used for gaming and video applications such as Ulead VideoStudio. One of the applications that implemented the MMX capabilities relatively quickly, was Adobe Photoshop (see also the section power).

Performance

Information about performance are strongly influenced by their overall system, the tested applications and applications, the algorithms used, the test method and the test company, and many other boundary conditions. Intel promises even at MMX processors 10-20% more power with traditional software, and up to 60 % more on MMX - optimized software. But especially with 3D graphics with many floating point brings MMX ( see diagram ) hardly improved performance, since the switching between MMX and FPU arithmetic ( "Context Switch" ) with up to 50 clock cycles can take a relatively long time to complete.

Sreraman and Govindarajan have in 2000 in terms of vectorization under the C language performance increases by factor 2 to 6.5 determined for MMX. When using Intel 's own program libraries for signal and imaging processing MMX brings performance improvements from a factor of 1.5 to 2, in graphics applications, 4 to 6 According to other studies, the use of MMX performance advantages of factors from 1.2 to 1, 75 In the MPEG decoding is limited according to the Intel MMX Performencegewinn by 40 percent. Thus MMX can bring significant performance advantages over non- optimized software only on particular tasks.

Test results can vary greatly even in the comparison of different versions of the same software. So version 4.0 resulted in a test of the optimized MMX for Adobe Photoshop for most filters performance gains 5 to 20 percent. In version 4.0.1 some actions with MMX, however, ran surprisingly slower than without MMX support.

After MMX

MMX was soon in high resolution form as they meet the increasing demands of rapidly changing graphics such as games, are no longer sufficient. Therefore, Intel introduced with the introduction of the Pentium III processor early 1999 the SSE technology. New CPU and FPU - independent 128-bit wide registers created - In this eight - and physically. It as well as entirely new commands was both the MMX instruction set extended created. SSE also extended the exclusive work of MMX integers (Integer) to floating point numbers. Subsequent follow-up versions also steadily expanded the capabilities of SSE.

The introduced in 1998 with the AMD K6 -2 AMD 3DNow! used as MMX FPU registers, but in a FPU appropriate type for processing floating point numbers. The following versions of 3DNow! eliminated incompatibilities with Intel's SSE concept.

Extension of the MMX instruction set SSE under

With twelve new SSE instructions for the MMX mode were introduced, which does not work with the new XMM registers of SSE, but only with the old MMX and FPU registers.

  • PAVGB and PAVGW form the rounded average of the two operands.
  • PEXTRW and PINSRW serve the extraction and insertion of Words.
  • PMAXSW, PMAXUB, PMINSW, and PMINUB calculate minima and maxima of two signed or unsigned bytes Words.
  • PMOVMSKB generated from the most significant bits of a Short Packed bytes a mask.
  • PMULHUW works like the old command PMULHW, but, in contrast, two unsigned Words.
  • PSADWB calculated for two values ​​of the absolute values ​​of the differences in their individual bytes and then adds the sum of these differences on.
  • PSHUFW mixes the individual ingredients of two 64 -bit values ​​according to rules which are passed through a third instruction operands.

SSE2 and SSE4

With SSE2 a unified instruction set was implemented, based on the 128 -bit XMM as wide on 64-bit MMX registers can be used as well. Some commands even allow the simultaneous use of both groups of registers, such as the conversion command CVPD2PI MMX, XMM. With SSE4 support for MMX was but then ended.

556419
de