Standard Compression Scheme for Unicode

The Standard Compression Scheme for Unicode ( SCSU, English for Standard Compression Scheme for Unicode ) is a character encoding for texts of Unicode characters, which is in contrast to most other encodings designed to require as little storage space.

  • 7.1 German
  • 7.2 Greek
  • 7.3 Japanese

History

The coding was originally developed by Reuters. Authors of the method described in the technical standard UTS # 6, Misha Wolf, Ken Whistler, Charles Wicksteed, Mark Davis, Asmus Freytag and Markus Scherer. The first release was in May 1997, since May 2005, the default is present in unchanged in the revision 4.

Idea

Traditional fonts in Unicode, such as the ISO -8859 character sets, needed only one byte per character, character sets for East Asian fonts two bytes. When using Unicode, the memory requirement usually increases: For UTF -32 to four bytes per character in UTF -16, there are two or four bytes per character in UTF- 8 between one and four bytes per character. Although traditional texts use only a very small part of all the available Unicode characters. Most characters are used on the one hand in the ASCII range (especially punctuation), on the other hand in a small contiguous area, which often corresponds to a Unicode block. The algorithm uses a dynamically positioned window comprising 128 consecutive characters. Characters in this window are encoded by a byte in the range of 0x80 to 0xFF characters in the ASCII range ( with the exception of most control characters) by a byte in the range of 0x20 to 0x7F. The remaining bytes are used as commands to this window repositioned or switch to uncompressed mode in which the following bytes are interpreted as UTF-16. This mode is especially useful when the text many characters are used in a range of more than 128 consecutive characters, such as in Chinese.

Algorithm

This idea is implemented using the following procedure. Defines the method of a SCSU byte stream back a text can be obtained from the Unicode character with the. For coding, different algorithms can be used to produce a result that can be decoded correctly. As such an algorithm is designed, among other things, depends on whether more importance is placed on a fast encoding, or a good compression.

Windows

The algorithm has two types of windows: Static window that are predefined fixed in the algorithm and dynamic window, whose position can be changed if necessary. From each variety, there are eight pieces, numbered from 0 to 7 The location of a window can be specified by code point of the first character in this window.

Static window

The eight static window are defined as follows:

Dynamic Window

The starting positions of the eight dynamic window include:

The dynamic window 0 is active at the start.

To change the position of a dynamic window, there are several commands. The two simple commands ( SDn and UDN ) to define determine the new position of the window by one byte according to the following table:

The two extended commands ( SDX and UDX ) to the window definition using two bytes. The top three bits indicate the number of the window to the remaining 13 bits will be 0x10000 added and taken the outcome as the first character of the window.

Modes

The algorithm uses two different modes. At first he is in one byte mode in the characters encoded by a single byte. Byte values ​​in the range 0x20 to 0x7F and 0x00 ( NUL ), 0x09 ( horizontal tab ), 0x0A (LF ), and 0x0D (CR ) are interpreted as characters in the static window 0, values ​​in the range 0x80 to 0xFF as characters in the active dynamic window. All other bytes are interpreted as commands.

The other mode is a two-byte mode. With a few exceptions here all pairs of bytes are interpreted as UTF- 16BE -encoded characters, only a few bytes represent commands

Commands

In single-byte mode commands provide the following byte values ​​represent:

Is to be coded, a control character that is represented by one byte, which is a command that commands SQ0 can be used.

In the two-byte mode byte values ​​represent the following commands, if they occur in the first position in a possible pair of bytes:

Should be encoded ( in the area for residential use) a character that starts with a command occupied by a single byte, the UQU command can be used.

Properties

The method has a few features that were deliberately chosen:

  • For texts, which consist exclusively of Latin -1 characters without control characters, there is no change.
  • For texts without character from the range for private use can always be switched with an extra byte in the two-byte mode, so that the memory requirement corresponds in this case of UTF-16.
  • Even in the worst case storage requirement is only a factor 1.5 larger than UTF-16.
  • With optimal coding standard texts are shorter than in UTF -8 or UTF-16. How big is this saving depends on the language: while the English and French texts SCSU just as much space as UTF -8, this reduces in Korean to 85 % in Chinese to 70 %, in Greek, Russian, Arabic, Hebrew and Japanese to 55% in Hindi even to 40%.

The following properties can be problematic in some applications:

  • In the compressed byte stream zero bytes may occur, among other reasons, the coding is not MIME -compliant. Here, instead BOCU - 1 are used.
  • The same text can be encoded in different ways.
  • Texts with a few different characters, but they are spread over several disjointed regions can not be well compressed. This is the case in Vietnam.

Possible encodings

Sequences of characters from the ASCII range and the predefined dynamic windows are most efficiently encoded in one byte mode. Where no suitable predefined window, it can be redefined an unnecessary dynamic window. Of the Chinese and Korean characters apart, most areas can be selected as a dynamic window.

For strings of characters outside of small areas should be switched to the two-byte mode.

Individual characters that are in a window that is not currently active can be encoded via the SQn command single character not within the given window on the SQU command.

Examples

German

57 69 6B 69 70 65 64 69 61 20 05 13 20 64 69 65 20 66 72 65 69 65 20 W ikipedia SQ4 - diefreie 45 6E 6B 6C 6F 7A 79 70 E4 64 69 65 E n z y k l o p ä d i e Up to indent the code with ISO 8859-1 agrees.

Greek

FB 18 A2 C9 C9 CA D0 C1 C4 C5 BF C9 C1 SD0 Β ι κ ι π α ί δ ε ι α The encoding requires only two bytes more than ISO 8859-7, but is shifted from this to 0x20.

Japanese

Are thereby used different fonts:

  • Latin letters and punctuation marks that are in the static window 0
  • Katakana from the dynamic window 6
  • Isolated Hiragana from the dynamic window 5
  • CJK characters that are not in any possible window
  • Full width punctuation marks from the dynamic window 7
  • CJK punctuation from the static windows 7

One of many possible encodings represent the following tables: Most of the time working with the dynamic window 6 ( Katakana ). Individual characters from other areas to be coded without a permanent change. For longer sequences of CJK character will also be changed in the two-byte mode, again only when longer sequences of hiragana or katakana must be encoded, it is switched back to the single-byte mode.

Use

One of the main problems of the method is to find a good algorithm for compressing and perform this. Since it is usually more efficient to save computing time than space, the effort is worth a compression SCSU for most applications not against UTF -8 or UTF-16. In addition, the lack of support of SCSU led in Web browsers and other programs to the fact that SCSU was not attractive enough, which, conversely, in a kind of self-fulfilling prophecy again that caused the encoding was still not supported.

Swell

  • Asmus Freytag include: Unicode Technical Standard # 6: A Standard Compression Scheme For Unicode. (online)
  • Doug Ewell: Unicode Technical Note # 14: A Survey of UnicodeCompression. (online)
745200
de