Chinese character encoding

Chinese character encodings (Chinese汉字 编码 方法/汉字 编码 方法, Pinyin Hànzì biānmǎ fāngfǎ ) arrange the Chinese characters to sequences of bytes for processing and storage in the computer. All Chinese character encodings also include an encoding of the ASCII characters.

There is probably no other language or script, for which there are so many encoding and input methods, such as the Chinese. Following statistics, exceeds the number of coding concepts for the input of Chinese characters, the number five hundred. There are about 40 to 50 different codes alone designed software that has been formally tested in the computer. Commercialized and commonly used but are not more than ten.

This obviously has to do with the large number of Chinese characters and the complicated shape, at the same time there is a direct connection with the facts that there are many dialects in China, does not match the language and writing in the various regions and the general high-level language is not sufficiently widespread.

  • 2.1 Big5
  • 2.2 GB2312
  • 2.3 GB18030
  • 2.4 Unicode 2.4.1 Unicode Transformation Formats
  • 2.4.2 SIP
  • 2.4.3 Other Unicode ranges

Coding and input

Most coding methods for Chinese characters that are entered with the keyboard, can be broadly divided into four categories:

  • " Flowing coding " (流水 码/流水 码, Liúshuǐmǎ )
  • Coding according to the shape of the character (字形 码/字形 码, Zìxíngmǎ )
  • Encoding according to the sound of the character (字音 码/字音 码, Zìyīnmǎ )
  • Coding according to sound and shape of the character (形 音 码/形 音 码, Xíngyīnmǎ or音 形码/音 形码, Yīnxíngmǎ ).

Liushui coding

Also无理 码/无理 码, wúlǐmǎ called ( unreasonable coding).

Normally, Arabic numerals or Roman letters are used to encode the Chinese characters, for example, a typical Liushui coding was the Sima - dianbao an encrypted Telegrammkode, that used the Ministry of Post and Telecommunications. In principle one can with the numbers from 0001 to 9999 encode nearly ten thousand characters. You can use the code to write telegrams, but the Ministry of Posts and Telecommunications used it as a coding method for Chinese characters.

The Guojia Biaozhun ( to German: national standard), ( version: "Information exchange with the basic set of characters for encoding Chinese characters (GB 2312-80 ) " ), coded 6763 Chinese characters with the positions 1601-8794 of the order of Liushui - codes. That's under the name区 位 码/区 位 码, Qūwèimǎ ( Zonenkode ) known code. The Telegrammkode the two characters中国( Zhōngguó to German: China), 0022 and 0948, and Zone Code is 5448 and 2590th

Coding according to the shape of the character

The coding according to the shape of the characters can be divided into three types: coding of the shape of the bars, coding for the root of the character code for characteristics of the character.

Encoding for the shape of the bars

The encoding for the shape of the strokes using the basic strokes as input units.

Li Jinkais eight strokes coding is a typical encoding for the shape of the bars. It shares the strokes of Chinese characters in eight varieties, a "一" Heng, "丨" Shu, "丿" Pie, "丶" Dian, Zhe Wan Cha, Fang, and encodes them with the digits from one to eight. For example, the encoding for the two characters中国82 and 81714th

The barcoding in Wubizixing code is the "method of the divided character". The lines "一" Heng, "丨" Shu, "丿" Pie, Na, Zhe are encoded using the number of one to five.

Coding for the root of the character

Will also radical - coding or structural coding called with the radicals of Chinese characters as input units.

Wang Yongmins Wubizixing code is typical for a coding of the root of the character. He reached 130 basic root characters together, arranged them on the keyboard, each key six root sign, a key is used more than once. Press the " L" stands for example for车,力,甲,田,四,口. When entering pressing the corresponding keys with the letter combination and you can already enter the character you want. If you press for example " khk " and " lgyi ", the two characters中国be displayed on the screen.

Coding for traits of the character

It is encoded according to the laws of contour features of Chinese characters. Examples:角 码/角 码, Jiǎomǎ ( Eckenkode ). There are the three corner coding of Wang An and the four-corner number encoding of Wang et al Yunwu

Encoding according to the sound of the character

The encoding according to the sound of the character is also called Pinyin input encoding, and is used in connection with intelligent input systems for Latin letters.

The characters are encoded with their articulation. Normally, the important factors initial sound, final sound and sounds come to fruition. The encoding according to the sound of the characters can be further divided into the types

  • " Full spelling " (全拼Quan PIN)
  • "Double spelling " (双拼/双拼, Shuang Pin) and
  • " Mixed spelling " (混 拼, HUN PIN).

An example of the " full spelling " from中国, Zhōngguó would be the following: You are one to eight characters. The double spelling is " vsgo ", one enters a code of four letters, it is "vs" for each of the initial sounds " zh " and " g", " s" and " o" respectively, for the end of a word " ong " and " uo ". The mixed spelling is " jiaty ", you are a code of five letters.

Of the above three varieties represents only the " full spelling " of the standardized spelling for the spelling of Chinese writing ( Pinyin), double the spelling and the mixed spelling have been created by the designers of the code. The above examples "double spelling " and " mixed spelling ' are each a natural code and a special design for the CCDOS system.

Coding according to sound and shape of the character

This encoding type is a combination of the coding according to the shape of the characters and the encoding according to the volume of the characters. Among them can be divided into volume - form encoding, form -sound encoding, phonetic encoding and meaning another.

Current use of

Top four ways were led to code and enter Chinese characters. From the perspective of the current application favor those who can speak Chinese and Pinyin for Chinese understand the Pinyin input method. Those who speak the dialect rather use a coding according to the shape of the character that Wubizixing is therefore dominated by the most professional typists.

Coding on the Internet

If you want to configure their browser when loading chinese language websites correctly, you meet most of the following codes:

Big5

The character encoding Big5 comes from Taiwan and is used for Traditional Chinese. ASCII characters are encoded in one byte and correspond to the normal ASCII encoding. Chinese characters are encoded in two bytes.

GB2312

The character encoding GB2312 is used for simplified Chinese. ASCII characters are encoded in one byte and correspond to the normal ASCII encoding. Chinese characters are encoded in two bytes.

GB18030

The character encoding GB18030 is an extension of GB2312 to the Unicode character set and is used for simplified Chinese. ASCII characters are encoded in one byte and correspond to the normal ASCII encoding. Chinese characters are encoded in two or four bytes. In the version of GB 18030-2000 110,000 characters are defined.

Unicode

Unicode is different from the other Chinese character encodings in that no difference between Simplified and Traditional Chinese made ​​, but all Chinese, Japanese and Korean characters are identified as far as possible by the Han unification.

Unicode Transformation Formats

Unicode assigns characters first abstract numbers ( code points) to whose implementation is defined in byte sequences to Unicode Transformation Formats:

  • In UTF- 8 ASCII characters are encoded in one byte and Chinese characters in three or four bytes.
  • UTF-16 ASCII characters are encoded in two bytes, and Chinese characters into two or four bytes.
  • In UTF- 32, all characters are encoded in four bytes without exception.

This Unicode Transformation Formats are also called encoding, making the length of the memory variables (1, 2, 4 bytes) is called and endianess, so that the byte order is defined (big endian, little endian ).

SIP

For a large number of little-used character codes in the Supplementary Ideographic Plane are allocated, ie in the range U 20000- U 2 FFFF.

Other Unicode ranges

Unicode also has areas for Bopomofo, radicals, and special characters that are used for the typography. The Latin characters indicating the sound, as they are needed for Pinyin are either coded individually or be represented on the field for combining diacritics.

Pictures of Chinese character encoding

184425
de