GB 18030

The Chinese character encoding standard GB18030 describes 27,484 characters of Chinese writing. Since 1 September 2001, it is binding for all articles sold in the People's Republic of operating systems and programs; it is the successor standard for encoding GBK and GB2312 and covers traditional and simplified characters. The official name is GB18030 -2000 and contains GB for Guojia Biaozhun (国家 标准/国家 标准), which means national standard. Was published in the Standard on 17 March 2000, an update was released on 21 November 2000.

GB18030 can be regarded as the Chinese equivalent to UTF -8, because it contains the code points for the entire Unicode range, even for non -assigned code points today. Similarly, there is a UTF-8 to ASCII backward compatible coding, in addition represents about one million code points ( in 4 -byte area of Unicode). In contrast to UTF- 8, however, maintains compatibility with GB18030 GB2312 GBK and; part of the mapping table is taken directly from GBK, the rest is determined algorithmically. In addition GB18030 also includes the characters from the Taiwan Big5.

Most (Western) computer systems had been a variant of Unicode standardized than GB18030 appeared. The estimates made technical simplification to treat Unicode as fixed units with 16- bit length UCS -2, could not be continued after its release. Operating system vendors and programmers were forced to speak through a " national republican available " to use either variable formats such as UTF -8 or UTF -16, or larger sizes, fixed width, such as UCS -4 or UTF -32. With Windows 2000, Microsoft took before this step, Linux had this support even before the introduction of GB18030.

The character set SimSun ( Founder Extended) allows the display of all the glyphs from GB18030, so the entire character is depleted from Unicode 2.1 and the additional of the "Unicode CJK Unified Ideographs Extension A and B". Other well-known character sets with at least partial support ( CJK Extension A) SimSun 18030 or Code2000.

Structure of the characters

Sequences from a byte correspond to ASCII and range from 00hex to 7Fhex. Sequences of 2 bytes correspond to GB2312 and consist of a start byte from the range 81hex ... FEhex, followed by a byte from the range 40hex ... FEhex. Sequences of 4 bytes represent the thus far unrecognized Unicode characters. First and third byte is in the range 81hex ... FEhex, the second and fourth bytes of 30hex 39hex .... In contrast to UTF -8 so you can at one octet in the range 30hex ... 7Fhex not assume that it is for an ASCII character, but this byte value, depending on its position have different meaning.