UTF-8

UTF -8 (abbr. for 8-bit UCS Transformation Format, which in turn UCS Universal Character Set abbreviates ) is the most common encoding for Unicode characters (Unicode and UCS are virtually identical ). The coding was set by Ken Thompson and Rob Pike at work on the Plan 9 operating system in September 1992. The coding was initially in the context of X / Open called FSS - UTF (file system safe UTF in contrast to UTF -1, which does not have this property), was carried out in subsequent years as part of the standardization renamed to the now common name UTF- 8

UTF -8 is congruent with ASCII in the first 128 characters ( 0-127 indices ) and is usually with only one byte of memory requirements for signs of many Western languages ​​especially for the coding of English-language texts, which can therefore be generally accessible without modification even edit with non- UTF -8 capable text editors easily, which one of the reasons for its status as de facto represents default character encoding of the Internet and related document types.

In other languages ​​, the memory requirement in bytes per character is greater when they depart from the ASCII character set: Already the German characters require two bytes; Cyrillic, Far Eastern and languages ​​from Africa take up to 4 bytes per character. Since the processing of UTF -8 as multibyte string also requires more memory space due to the necessary analysis of each byte in comparison to character encodings with a fixed number of bytes per character more computational effort and for some languages ​​, other UTF encodings for mapping UNICODE depending on the application scenario character sets used: Microsoft Windows uses internally as most used desktop operating system as a compromise between UTF -8 and UTF -32 as UTF -16 Little Endian.

General

In the UTF -8 encoding each Unicode character is assigned a specially coded variable length string. It supports UTF -8 character strings up to a length of four bytes, to which - can represent all Unicode characters - as with all UTF formats.

UTF- 8 has a key role as a global character encoding on the internet. The Internet Engineering Task Force requires that all new Internet communication protocols that the character encoding is declared and that UTF -8 one of the supported encodings. The Internet Mail Consortium (IMC ) recommends that all e -mail programs can display and send UTF -8. In 2008, this recommendation was, however, still not followed globally.

Even when applied in web browsers markup language HTML to UTF -8 language-specific character sets for display by increasing, replacing the previously used HTML entities.

Properties

  • Multi-byte character encoding ( MBCS) similar CP950/CP936/CP932 (Chinese / Japanese), but without the (then important and useful ) property that double-width characters depicted are two bytes long
  • 7-bit ASCII is the same UTF-8, and highly compatible with the previous 8-bit character sets
  • Trailing bytes are not 7-bit ASCII characters (allows processing and parsing with usual 7 -bit character constants)
  • Relatively compact, especially on European characters, slightly less at ( for example ) Chinese characters in Kodepositionen higher, often much more compact than UTF -16 ( Windows)
  • Sortability remains, two UTF -8 strings have the same sort order as two unencoded Unicode strings
  • Searchable in both directions ( in previous MBCS not)
  • Simple transcoding function ( also easy hardware- implementable )
  • Abundant encoding Reserve (if the Unicode standard is still something changes )

Standardization

UTF -8 is provided by the IETF, the Unicode Consortium and ISO currently defined identically in the standard documents:

  • RFC 3629 / STD 63 (2003)
  • The Unicode Standard, Version 4.0, § 3.9 - § 3:10 (2003)
  • ISO / IEC 10646-1:2000 Annex D (2000)

These are replacing older, partially different definitions, some of which are still used by older software:

  • ISO / IEC 10646-1:1993 Amendment 2 / Annex R ( 1996)
  • The Unicode Standard, Version 2.0, Appendix A (1996)
  • RFC 2044 (1996)
  • RFC 2279 (1998)
  • The Unicode Standard, Version 3.0, § 2.3 (2000) and Corrigendum # 1: UTF -8 Shortest form ( 2000)
  • Unicode Standard Annex # 27: Unicode 3.1 ( 2001)

Coding

Unicode characters with the values ​​from the range of 0 to 127 ( 0 to 7F hexadecimal) are reproduced in the UTF -8 encoding as a byte with the same value. Therefore, all data will be used exclusively for the real ASCII characters, identical in both representations.

Unicode characters greater than 127 will be byte strings of length two to four encoded in the UTF -8 encoding.

The algorithm can theoretically be up to eight bytes long byte strings and thus about four trillion mark. The final stage included as the first byte 11111111 and then next seven bytes each with six payload bits. The entire code sequence would be 2 (7 * 6) = 242 = 4,398,046,511,104 characters). Real was originally defined a sequence of a first byte with up to 1111110x and thus following five bytes of the form 10xxxxxx, so together six bytes with a total of 31 bits for the contained Unicode value. In its use as a UTF- coding it is limited to the common code space of all Unicode encodings, ie 0-0010 FFFF ( 1,114,112 options ) and has a maximum of four bytes long byte strings on. The available range of values ​​so that the character code is still not fully used. According to long sequences of bytes and large values ​​are now considered invalid codes and should be treated accordingly.

The first byte of a UTF - 8 encoded character called this start byte, other bytes hot episode bytes. Start bytes always begin with 0 or 11, follow-up with 10 bytes always

  • If the most significant bit of the first byte is 0, it is an ASCII character, since ASCII is a 7-bit encoding, and the first 128 Unicode characters correspond to ASCII characters. To ensure that all ASCII strings are automatically upwards compatible with UTF -8.
  • If the most significant bit of the first byte 1, is a multi-byte characters, so a Unicode character with a character number greater than 127
  • Are the two highest bits of a byte 11, is the starting byte of a multibyte character, they are 10 to a subsequent byte.
  • The lexical order by byte values ​​corresponding to the lexical order by character numbers, as higher numbers of characters are encoded with correspondingly more 1- bits in the start byte.
  • At the start byte of multi-byte characters, the number of the highest 1- bits is the total number of bytes of the encoded as multi-byte Unicode character. Interpreted differently, the number of 1- bits to the left of the highest 0 - bits is the number of following bytes plus one, 1110xxxx 10xxxxxx 10xxxxxx = eg three bits before the highest bit 0 = three bytes in total, two bits for the highest one bit before the highest bit = 0 two subsequent bytes.
  • Start byte (0 ... or 11 ...) and following bytes (10 ... ) can be clearly distinguished from each other. Thus, a stream of bytes to be read even in the middle, without there being problems with the decoding, which is particularly important in the recovery of defective data. Bytes starting with 10 are simply skipped, up to 0 ... or 11 ... is detected. That start byte and subsequent bytes are clearly distinguished from each other, is an advantage of UTF -8 encoding. For encodings without this property is reading a data stream whose beginning is unknown, may not be possible.

Note:

  • The same characters can theoretically be encoded in different ways (for example "a" as 01100001 or 11000001 as falsely 10,100,001 ). However, only the shortest possible encoding is allowed. This fact has often led to problems by programs crash with invalid encodings, interpret these as valid or simply ignore them. The combinations of the last two behaviors led eg to firewalls do not recognize the dangerous content on the basis of invalid encoding, which is these encodings, however, interpreted to be protected client as valid and jeopardized.
  • For more bytes for a character, the bits are arranged flush - the lowest bit (least significant bit ) of the Unicode character so is always in the lowest bit of the last UTF -8 bytes.
  • Originally there was also codes with more than four octets (up to six ), they have been excluded, however, since there is no corresponding characters in Unicode and ISO 10646 were aligned in its possible range of characters to Unicode.
  • For all scripts based on the Latin alphabet UTF -8 is a particularly space- saving method for mapping of Unicode characters.
  • The Unicode ranges U D800 -U and U DC00 - DBFF U DFFF are expressly disclaims any character, but only serve to UTF -16 to encode characters outside the Basic Multilingual Plane, they were formerly known as Low and High surrogates. Consequently byte sequences that correspond to these areas is not a valid UTF -8. For example, U 10400 is represented in UTF- 16 as the D801, DC00, should be expressed as F0, 90,90,80 and not as ED, A0, 81, ED, B0, 80 in UTF -8 but. Java supports this since version 1.5. Due to the widespread use of wrong encoding, especially in databases, this coding was subsequently normalized as CESU -8.
  • In UTF -8, UTF -16 and UTF -32 The full range of Unicode encoded.

Due to the coding rule of UTF -8 bytes specified are not allowed. The following table lists all 256 possibilities are summarized and given their use and validity. Bytes in red lines are inadmissible, green describes allowable bytes representing a character directly. In those blue values ​​are stored, which begin the start of a sequence of two or more bytes and then continued as a sequence of bytes from orange deposited lines.

Some coding examples for UTF -8 are given in the following table:

The last example is outside of the originally included in Unicode ( under version 2.0) code range ( 16 bits), which in the current Unicode version as BMP range (level 0 ) is included. Since there are currently many fonts not listed these new Unicode ranges, the characters contained there on many platforms can not be displayed correctly. Instead, a surrogate is shown which serves as a placeholder.

Representation in editors

Byte Order Mark

Although UTF- 8 due to the type of encoding principle may not occur the problem of different byte orders, add some programs a byte order mark a (BOM German byte order mark ) at the beginning of the file from UTF- 8 files. The BOM consists of the byte sequence EF BB BF, which usually appear in non- UTF -8 capable text editors and browsers as ISO -8859 -1 string ï »¿ and may be responsible for compatibility issues.

Characters not included in the Unicode Basic Latin block

Letters of the English alphabet are displayed identically in UTF -8 and ISO -8859. Problems occur with the other characters, such as umlauts.

An example of the word height:

Swell

354825
de