Character encoding

A character encoding (English character encoding, short- Encoding) allows the unambiguous assignment of characters (letters or numbers) and symbols within a character set. In information processing characters are encoded via a numerical value, making them suitable for transmission or stores. The German umlaut Ü, for example, in ISO -8859- 1 character set encoded with the decimal value 220. In the EBCDIC character set, the value 220 ​​encodes the curly bracket }. For the correct representation of a character so the character encoding must be known; the numerical value alone is not enough.

Numerical values ​​of character encodings can be transported in various ways, for example by optical, acoustic or electric signals, usually by means of sequences of

  • Long and short signal ( for example in Morse code )
  • High and low tone (eg acoustic transmission of fax machines ),

Binary systems have had a special significance, since with increasing number of basis elements of the code increases the likelihood of confusion ever since.

History

The origins of this technology lie in antiquity. For example, Agamemnon informed his troops out of a ship with the light of a fire that he wanted to start the invasion of Troy. Furthermore, smoke signals are known to the Indians or messaging through drum characters in Africa.

In particular, the understanding of convoys in the nautical techniques were refined later. Sir Walter Raleigh invented for the understanding of his squadron to the South America trip in 1617, a sort of precursor of the flag encoding.

1648 finally it was England later King James II, who introduced the first signal flag system in the British navy.

After the invention of telegraphy you needed a character encoding here. From the original ideas of the Englishman Alfred Brain originated in 1837, the original Morse code and 1844, the modified Morse code.

The CCITT ( Consultative Committee International Telegraphy et Telephonique ) was finally the first institution which defined a standardized character set. In this case this character set based on a 1870 developed by Jean -Maurice -Émile Baudot 5- code alphabet for its synchronous telegraph, the Baudot code, the principle is still used today.

And computer data exchange

With the development of the computer implementation of the binary character encoding used basically since the Baudot code into bit sequences began, or internally usually divided into different voltage values ​​as a distinguishing criterion, analogous to the previously used to distinguish the signal values ​​pitch or signal duration.

To assign these bit sequences displayable characters, had to translation tables, called character sets, Eng. Charsets are defined. 1963 's first 7 -bit ASCII code by the ASA (American Standards Association ) was defined in order to achieve a standardization of the character encoding. Although IBM had worked on the definition, it led in 1964 its own 8 -bit character code EBCDIC. Both can be found today in computer technology use.

Since for many languages ​​each have different diacritical marks are needed with which letters of the Latin alphabet are modified, there are separate fonts for many language groups. The ISO has standardized with the standard series ISO 8859 character encodings for all European languages ​​(including Turkish) and Arabic, Hebrew and Thai.

The Unicode Consortium finally published in 1991, an initial version of the same standard, which has set itself the goal of defining all the characters of all the languages ​​in code form. Unicode is also the international standard ISO 10646th

Before a text is processed electronically, the character set used and the character encoding must be set. These are for example the following information:

  • Defining the character set in an HTML page

Defining the character set in the header ( header) of an e -mail or a HTTP packet

Content-Type: text / plain; charset = "ISO -8859 -1" graph

The presence of software for encoding and decoding does not guarantee the correct display on a computer screen. For this purpose, a font must be available that contains the characters of the character set.

Differentiation of the terms by the introduction of Unicode

Represent With the introduction of Unicode and the associated need to sign by more than one byte more accurate terms necessary. Currently, the German, the terms character set, code, coding, encoding sometimes interchangeably, sometimes used differentially. In English, there are already clear differentiations:

A character set ( character set or character repertoire ) is a set S of different characters.

A character code ( ccs, coded character set) is a character set S with an injective mapping between the characters in S and a code set M ( finite subset of the natural numbers ). The set M is also codespace, and the sets S and M together with the mapping from S to M is also called a code page.

A code point ( codepoint or encoded character) is an element of the set M code that identifies its associated characters from S. Texts are represented by the code points of their characters, ie as a sequence of numbers from M.

Next you need to specify how the code points are represented in the computer. With encoding form ( character encoding form, cef ) or encoding scheme (character encoding scheme, ces ) ( not a good translation into German known), refers to a method of expressing each code point as a sequence of one or more bytes. Such a sequence of bytes that represents a code point, and thus a sign, it is called code unit. In simple cases, there are no more than 256 = 28 code points, so that it can store each code point in a byte, as it frequently occurs, for example when using one of the 8859 defined in ISO character codes.

When using Unicode that is no longer possible, since S contains far more than 256 characters. Here we used, for example, UTF -16, where the code points from 0 to 216-1 stored in two and all the major in four bytes. A distinction is UTF- 16BE (big- endian), and UTF- 16LE (little- endian), which differ in the order of the bytes in a code unit.

When UTF -32 always uses four bytes for each code point and UTF -8, depending on the code point one or more bytes: the code points 0-127 are stored in a single byte, so this representation to save space in most English and European texts is because the characters with these code points ( characters in the ASCII ) are still by far the most common. Other techniques include: SCSU BOCU and Punycode. Complex Schemes can switch between multiple versions ( ISO / IEC 2022 ).

To specify the order of the bytes in a code unit unequivocally is often a BOM (byte order mark ) added before ( 0xEF, 0xBB, 0xBF in UTF -8; 0xFF, 0xFE in UTF- 16LE, 0xFE, 0xFF in UTF- 16BE ).

A glyph is a graphical representation of a single character.

Example: The Chinese character for mountain, shan山has the Unicode code point U 5 C71 =山and required to display 15-bit. With UTF -16 as cef it is stored as a code unit. With ces bigendian 5C, 71 is in memory, with littleendian 71, 5C. The three units E5, B1, B1 With UTF -8 are in memory. The glyph is山.

For the convenience of the reader confused, it should be noted that the vast majority of texts are stored in one of three Unicode encodings UTF- 8, UTF- 16BE or UTF- 16LE, which significantly facilitates the approach to texts.

179945
de