UTF-16

UTF -16 (English for Universal Multiple - Octet Coded Character Set (UCS ) Transformation Format for 16 Planes of Group 00 ) is a variable-length encoding for Unicode characters. UTF -16 is multilingual the most frequently used characters from the Basic plan (BMP ) preparations. It is the oldest of the Unicode encoding formats.

General

In the UTF- 16 encoding each Unicode character is assigned a specially encoded byte string of two or four bytes in length, so that it is possible as in the other UTF- formats represent all Unicode characters.

While UTF -8 plays a central role in Internet protocols, UTF-16 is widely used for internal string representation, for example, in current versions of Java.

Properties

Due to the coding of all signs of BMP into two bytes, the UTF- 16 encoding has twice the space requirement in comparison with UTF -8 or ISO -8859 encodings suitable for texts that consist mainly of Latin letters. However, if many BMP characters beyond the codepoints U 007 F coded as Chinese, UTF-16 has comparable or smaller footprint.

There is no encoding reserve in contrast to UTF -8. If a UTF-16 encoded text interpreted as ASCII, Latin letters are indeed recognizable, but separated by null bytes.

Standardization

UTF -16 is defined both by the Unicode Consortium and ISO / IEC 10646. Unicode defines additional semantics. A more accurate comparison can be found in Appendix C of the Unicode 4.0 standard. The ISO standard defines an encoding further UCS-2, but in which only 16-bit representations of the BMP will be accepted.

Coding

The characters of the BMP are mapped directly to the 16 bits of a UTF- 16 code unit. The BMP contains the Unicode character whose code is in the range U 0000 to U FFFF.

Unicode characters outside the BMP (ie U 10000 to U 10 FFFF ) are two 16 -bit words (English code units) shown are formed as follows:

From the number of the character the number 65536 ( 10000hex ) is first subtracted ( = size of the BMP ), whereby a 20 -bit number in the range of 00000hex created to FFFFFhex, which then split into two blocks of 10 bits and the first block the bit sequence 110110, the second block, however, the bit sequence is preceded by 110111. The first of the two thus resulting 16 -bit words is referred to as the high-surrogates, the second as a low- surrogates, and their names according contains the high- surrogates, the high-order 10, the low- surrogates, the 10 low-order bits of the reduced to 65536 original character codes. The code range U D800 to U DBFF (High Surrogates ) and the Unicode range U DC00 to U DFFF (low surrogate ) is specifically reserved for such UTF- 16 surrogates and therefore contains no independent character.

When converting from UTF- 16 strings in UTF -8 byte sequences before this then in a UTF -8 is to be noted that high- and low- surrogate ( surrogate ) must be compiled into a normal Unicode character code first of all, can be converted to byte order ( example in the description of UTF -8). Since this is often ignored, is a different, incompatible coding for the surrogate has been established which has been normalized in retrospect CESU -8.

Byte Order

Depending on which of the two bytes is first transferred or stored, it is called big-endian (UTF- 16BE ) or little endian (UTF- 16LE ).

In inadequate protocols specified is recommended that the Unicode character U FEFF (BOM, byte order mark ), which represents a space with zero width and without a line break (zero width no-break space) to put at the beginning of the data stream - it is interpreted as the invalid Unicode character U FFFE (not a character), this means that the byte order is different between the transmitter and receiver, and the bytes of each 16 -bit word must be reversed at the receiver, by which correctly evaluate subsequent data stream.

Examples

Some coding examples for UTF -16 are given in the following table:

The last two examples are outside the BMP. Since there are currently many fonts not listed these new Unicode ranges, the characters contained there on many platforms can not be displayed correctly. Instead, a surrogate is shown which serves as a placeholder. In the examples is determined by the subtraction of 10000hex only one or two bits changed (magenta) and formed from the bits thus formed the surrogates.

796286
de