GBK

GBK (Chinese汉字 内 码 扩展 规范, Guojia Biaozhun Kuozhan ) is a Chinese character set. It expands GB2312 to traditional characters as well as characters that were simplified by the introduction of GB2312 1981.

History

Unicode 1.1 was published in 1993, contains 20,902 Chinese characters. The Chinese government has subsequently published GB13000.1 -93, which is 1:1 identical to Unicode 1.1. To bridge the gap between this standard and the older GB2312 (1980 ), also the GBK GB2312 to expand the characters GB13000.1 -93 has been introduced. Because GBK never became the official standard, it was not a regular UK number. 1995 GBK has been extended to 95 more characters.

In Windows 95, GBK has been adopted as code page 936 in unchanged form. Thus, the spread of GBK GBK increased enormously and has become the de facto standard. Later, the Euro sign was added to the code page 936, which made the code page is incompatible with GBK.

In most Windows variants GBK but misleadingly as GB2312. Only from Windows XP and the original standard GB2312 was also offered under Windows, under the code page number 20936 with the name " GB2312 -80 ".

Since 2000, GBK is officially replaced GB18030.

Construction

GBK 16 is a variable -bit coding, that is, a mark can be either one or two bytes. The characters in the range 0x00 - 0x7F are identical to ASCII and consist of one byte. The characters in the range 0x81 - 0xFE hand, consist of 2 bytes.

An encoded in GBK text can be searched forward only, since any character can not be distinguished whether it is starting byte or two-byte End byte of an encoding. To distinguish the text from the beginning needs to be investigated. This disadvantageous property with GB2312 GBK and GB18030 and other Asian encodings SHIFT -JIS (Japanese), BIG -5 ( Traditional Chinese) and EUC -KR ( Korean) together. In GB2312 also found by reverse lookup ASCII characters ( byte value less than 128 ) can be used as a starting point for a forward analysis, since these values ​​are not included in the two-byte characters; at GBK reduces this possibility to ASCII characters in the range 0 to 63, as well as byte values ​​in the range 64 to 127 are used as end byte of a double-byte character. This problem avoids the Unicode Transformation UTF -8. Although up to four bytes are needed per character, but can be clearly told of each byte whether it is a single-byte character, a starting byte of a multibyte character or a follow-up or end byte of a multibyte character.

The Zweibytebereich is divided into eight levels:

195782
de