Unicode Transformation Format

A Unicode Transformation Format, also UCS Transformation Format UTF abbreviated, is a method to map Unicode characters to sequences of bytes.

For the representation of the Unicode character for the purposes of electronic data processing, there are various transformation formats. In each of the formats, all characters in the Unicode standard, 1,114,112 ( code points ) can be prepared. Nor can any of these formats convert lossless to another UTF format.

The various formats differ in their space on storage media ( storage efficiency ), the encoding and decoding effort ( run-time behavior ) and their compatibility with other (older) types of encoding, for example, ASCII. For example, while some formats allow very efficient access ( random access) to individual characters within the string, others go to frugal with memory. Therefore, to determine the most appropriate for the intended field of application in the selection of a particular Unicode Transformation Formats.

  • 2.1 UTF -1
  • 2.2 UTF -7
  • 2.3 UTF- EBCDIC
  • 2.4 UTF -5, UTF -6, UTF -9 and UTF -18
  • 2.5 SCSU

UTF -8, UTF -16 and UTF -32

  • UTF -32 always encodes a character in exactly 32 bits, making it the easiest, since no variable character length is used and no intelligent algorithm is needed, but at the cost of memory size - only characters from the ASCII character set are used, is four times as much space required compared with one encoding to ASCII. Depending on the sequence of bytes, whether first the least significant or the most significant byte is transmitted, it is called Little Endian (UTF- 32LE ) or big endian (UTF- 32BE ).
  • UTF-16 is the oldest encoding method, in which one or two 16 -bit units (2 or 4 bytes) are used to encode a character. Again, a distinction depending on the sequence of bytes from the UTF- 16LE frequent and UTF- 16BE. For languages ​​with non-Latin characters, this is the space-saving option as it usually manage with 2 bytes.
  • UTF-8 encoded characters with a variable number of bytes. Here, a Unicode character in 1 to 4 bytes is encoded. The code points 0 to 127, corresponding to the ASCII character set are encoded in one byte, the most significant bit is always 0. Use of the eighth bit, a longer Unicode characters are introduced, which extends to 2, 3 or 4 bytes. This is handled most efficiently with the space for fonts based on the Latin alphabet. For those with no Latin characters, however, usually take 3 bytes are needed, more than in UTF-16.

All standards can be transmitted or stored with or without a clear signature at the beginning, the Byte Order Mark ( BOM). Especially when editing files with different programs on different computer systems, the BOM helps in proper identification. Is everything previously clearly defined or the information is transmitted differently (such as "charset " in HTML), so it is omitted.

Examples

De: Change 64 00 00 00 | 65 00 00 00 | 3A 00 00 00 | 56 00 00 00 | 65 00 00 00 | 72 00 00 00 | E4 00 00 00 | UTF- 32LE ↵ 00 00 00 64 | 00 00 00 65 | 00 00 00 3A | 00 00 00 56 | 00 00 00 65 | 00 00 00 72 | 00 00 00 E4 | UTF- 32BE ↵ d | e |: | V | e | r | ä | de: verae ↵ 6E 00 00 00 | 64 00 00 00 | 65 00 00 00 | 72 00 00 00 | 75 00 00 00 | 6E 00 00 00 | 67 00 00 00 | UTF- 32LE 00 00 6E 00 | 00 00 00 64 | 00 00 00 65 | 00 00 00 72 | 00 00 00 75 | 00 00 6E 00 | 00 00 00 67 | UTF- 32BE n | d | e | r | u | n | g | change 64 00 | 65 00 | 3A 00 | 56 00 | 65 00 | 72 00 | E4 00 | 6E 00 | 64 00 | 65 00 | 72 00 | 75 00 | 6E 00 | 67 00 | UTF- 16LE 00 64 | 00 65 | 00 3A | 00 56 | 00 65 | 00 72 | 00 E4 | 6E 00 | 00 64 | 00 65 | 00 72 | 00 75 | 00 6E | 00 67 | UTF- 16BE d | e |: | V | e | r | ä | n | d | e | r | u | n | g | de: Change 64 | 65 | 3A | 56 | 65 | 72 | C3 A4 | 6E | 64 | 65 | 72 | 75 | 6E | 67 | UTF -8 d | e |: | V | e | r | ä | n | d | e | r | u | n | g | de: Change mk: Промена - Macedonian language with Cyrillic alphabet 6D 00 00 00 | 00 00 6B 00 | 3A 00 00 00 | 1F 04 00 00 | 40 04 00 00 | UTF- 32LE ↵ 00 00 6D 00 | 00 00 00 6B | 00 00 00 3A | 00 00 04 1F | 00 00 04 40 | UTF- 32BE ↵ m | k |: | П | р | mk: Пр ↵ 3E 04 00 00 | 3C 04 00 00 | 35 04 00 00 | 04 3D 00 00 | 30 04 00 00 | UTF- 32LE 00 00 04 3E | 00 00 04 3C | 00 00 04 35 | 00 00 04 3 | 00 00 04 30 | UTF- 32BE о | м | е | н | а | омена 6D 00 | 6B 00 | 3A 00 | 1F 04 | 40 04 | 04 3E | 3C 04 | 35 04 | 3D 04 | 30 04 | UTF- 16LE 6D 00 | 00 6B | 00 3A | 04 1F | 04 40 | 04 3E | 04 3C | 04 35 | 04 3 | 04 30 | UTF- 16BE m | k |: | П | р | о | м | е | н | а | mk: Промена 6D | 6B | 3A | 9F D0 | D1 80 | BE D0 | D0 BC | D0 B5 | BD D0 | D0 B0 | UTF -8 m | k |: | П | р | о | м | е | н | а | mk: Промена Nepali uses the alpha syllabic syllable Devanagari script. A syllable corresponds to one character, with a few basic characters add by modified by vowel signs and result in different syllables. (Similarly, we write an e with an acute on the computer, only that this converts it into É, a private characters in Unicode. However Nepalese signs are also in Unicode composed. The circle is a placeholder for the basic character with which this extension respond.) It thus consists of two characters that have been modified or twice. This is in contrast to China, where there are many different syllabary characters. Modifying Unicode characters there are, for example, in the Hebrew Scriptures.

Ne: चांजे - Nepali 6E 00 00 00 | 65 00 00 00 | 00 00 00 3A | 1A 09 00 00 | 3E 09 00 00 | 02 09 00 00 | 1C 09 00 00 | 47 09 00 00 | UTF- 32LE 00 00 6E 00 | 00 00 00 65 | 00 00 00 3A | 00 00 09 1A | 00 00 09 3E | 00 00 09 02 | 00 00 09 1C | 00 00 09 47 | UTF- 32BE n | e |: | च ा ं | ज े | ne: चांजे 6E 00 | 65 00 | 00 3A | 1A 09 | 3E 09 | 02 09 | 1C 09 | 47 09 | UTF- 16LE 6E 00 | 00 65 | 00 3A | 09 1A | 09 3E | 09 02 | 09 1C | 09 47 | UTF- 16BE n | e |: | च ा ं | ज े | ne: चांजे 6E | 65 | 3A | E0 A4 9A | E0 A4 BE | E0 A4 82 | E0 A4 9C | E0 A5 87 | UTF -8 n | e |: | च ा ं | ज े | ne: चांजे zh :变化- Chinese Languages 7A 00 00 00 | 68 00 00 00 | 00 00 00 3A | D8 53 00 00 | 16 53 00 00 | UTF- 32LE 00 00 00 7A | 00 00 00 68 | 00 00 00 3A | 00 00 53 D8 | 00 00 53 16 | UTF- 32BE z | h |: | 变 | 化 | zh :变化 7A 00 | 68 00 | 00 3A | D8 53 | 16 53 | UTF- 16LE 00 7A | 00 68 | 00 3A | 53 D8 | 53 16 | UTF- 16BE z | h |: | 变 | 化 | zh :变化 7A | 68 | 3A | E5 8F 98 | E5 8C 96 | UTF -8 z | h |: | 变 | 化 | zh :变化 Other Unicode encodings

The Unicode standard defines only UTF -32, UTF -16 and UTF -8. In addition, there are other encodings which can also encode all Unicode characters.

UTF -1

UTF- 1 was the first 8-bit encoding for Unicode, but could not prevail due to several weaknesses.

UTF -7

UTF -7 is an obsolete format, ( each requiring only the lower 7 bits of a byte, hence the name of the format ) which Unicode characters into printable ASCII characters encoded. The format was intended for the transmission of Unicode texts about 7- bit channels ( eg e- mail or Usenet), but could not prevail. Instead, most UTF -8 is used for this application, combined with a MIME transfer encoding as base 64 or Quoted-printable is used, or even UTF -8 with an 8 -bit channel.

UTF- EBCDIC

UTF- EBCDIC is a Unicode encoding, which is based on the proprietary 8- bit EBCDIC format from IBM mainframes, comparable as UTF -8 to ASCII.

However, it encodes the first 160 characters (65 control characters and graphic characters 95 ) in one byte to the usual EBCDIC positions, where they exist, the remaining Unicode stock analogous to UTF- 8 in two to five bytes ( or up to seven for code positions that are already in UTF -16 can not be represented, and will therefore probably never occupied by characters), at positions which are at various EBCDIC code pages occupied with various graphic characters. BOM is about to (hexadecimal) DD 73 66 73 ( a four -byte sequence). The same sign is partially, depending on location, are also encoded shorter or longer than in UTF -8.

It was developed with the aim to facilitate the processing of Unicode data in existing mainframe applications. In practice, UTF- EBCDIC is also on mainframes rarely used.

EBCDIC -based mainframe operating systems such as z / OS generally use UTF-16. For example, UTF- 16 is supported by components such as DB2, COBOL, PL / I, Java, and the IBM XML Toolkit.

UTF -5, UTF -6, UTF -9 and UTF -18

UTF -5 and UTF -6 were proposed for use in the International Domain Name (IDN ). In their place, however, Punycode was standardized. UTF -9 and UTF -18 represented an April Fool's joke, are in principle on computers with 9- bit bytes but implementable.

SCSU

The Standard Compression Scheme for Unicode is an encoding that is primarily focused on a small memory footprint. It can be shown all Unicode characters, ranges one byte per character for most languages ​​. Unlike other encodings can encode many different ways a text. In practice, SCSU but could not prevail.

789785
de