Byte Order Mark

As a Byte Order Mark (BOM; German byte order mark ) is the Unicode character U FEFF (English zero width non-breaking space ) denotes the beginning of a data stream where it as an identifier to define the byte order and encoding form in UCS / Unicode strings, in particular text file is used.

In the UTF -16 and UTF -32 byte sequence must be specified, because the characters are each encoding at least either 16 or 32 bit large values and thus require multiple bytes. The byte order mark here denotes the order in which the bytes must be evaluated. This marker is always very important, when data is exchanged between different systems.

In UTF- 16, the BOM is in big-endian notation of the two-byte sequence FE FF, in the little-endian notation reversed from FF FE. Since the character U FFFE is defined as invalid, can be determined by the sequence of two bytes clearly the order of the bytes. In UTF -32 in front of it or behind it, two zero bytes that are used to detect the byte order.

The UTF -8 encoding of the BOM byte sequence consists of EF BB BF, which usually appear in non- UTF -8 capable text editors and browsers as ISO -8859 -1 characters ï »¿. In UTF -8, the problem of byte order is not true, but a BOM at the beginning of the file or string is allowed to characterize the use of UTF -8 as the encoding. A reliable distinction between UTF -8 and ISO -8859 character sets is thus not guaranteed, as in the 8-bit character sets, all byte sequences are allowed, and the UTF -8 encoding of the BOM; but when the alternative is specifically UTF -8 or ISO 8859-1 is, is the pragmatic assumption that the string ï »¿ is not meant quite common.

If a BOM is used, but it can also cause problems with programs that expect a Byte Order Mark or know. So in Unix-like environments, the shebang mechanism is often used in script files used in which the string " #! " also must be at the beginning of the file. Here stands an unexpected BOM, there are problems. Also changed report compiler such as gcc when using an excess BOM character at the beginning of the file, and in PHP prior to version 6 with default settings results in the BOM to the output of characters to the browser, so that no "output buffering" HTTP headers more can be.

In Java, the Byte Order Mark is not automatically detected when reading UTF -8 texts. It is up to the application software to remove the rules generated characters 0xFEFF when needed.

Byte sequences of the BOM in different character encodings

Text encoding in UTF -16 in a hex editor:

4400 6900 6500 | e D i | = UTF- 16LE / UCS -2LE 0044 0069 0065 | e D i | = UTF- 16BE / UCS- 2BE Web Links

The Unicode Standard, chapter 2.6 Encoding Schemes ( English, PDF, 1.10 MiB)
The Unicode Standard, chapter 2:13 Special Characters and Noncharacters, section Byte Order Mark (BOM ) ( English, PDF, 1.10 MiB)
The Unicode Standard, chapter 16.8 Specials, section Byte Order Mark (BOM): U FEFF ( English, PDF, 415 KiB)
Unicode FAQ: UTF- 8, UTF- 16, UTF - 32 & BOM ( English)

UTF-16 UTF-32 Endianness UTF-7 UTF-1 Standard Compression Scheme for Unicode

157008