Binary Ordered Compression for Unicode

Binary Ordered Compression for Unicode (English for binary -level compression for Unicode, BOCU short ) is a family of encodings for text from Unicode characters, which is aligned to the other on the smallest possible memory footprint, to obtain the binary order, secondly. The best known representative BOCU -1 is also directly compatible with the MIME protocol. However, the method could not be established in practice.

History

BOCU was developed in 2001 by Mark Davis and Markus Scherer for the ICU project. BOCU -1 is an application of this principle, which is described in UTN # 6. However, it lacks a formal definition, only the code of a C program describes the encoding. The BOCU algorithm is patented in the United States Patent and Trademark Office.

Idea

The idea of BOCU is that the code points of consecutive characters usually differ only slightly. So if characters encoded by the difference to the previous character and small differences with a byte, greater contrast, represented with more bytes, it can thus save storage capacity. BOCU actually take the difference for the last character, the difference is used to form a base mark that can be determined in various ways. It may, for example, the central character in the last block can be used as a basic Unicode characters so as to avoid long jumps from one end of the block to the other. It is also possible not to change the basic character with a change of the block equal, so as to avoid in space or punctuation characters from the ASCII range between characters from another block a long return.

In BOCU -1 is the set of byte values ​​that are used for coding the differences so limited that compatibility is guaranteed with MIME. In addition, control characters and spaces are encoded directly.

Properties

Due to their construction have BOCU and BOCU -1 the following properties:

  • BOCU receives the binary order. So is a list of strings in binary sorted code points accordingly, so this is also true on the BOCU - encoded byte sequences.
  • BOCU is - in contrast to SCSU - deterministic, each text has a unique encoding. However, the same character may be encoded differently in different places.
  • BOCU -1 MIME - compatible: The ASCII control character NUL (0x00), LF ( 0x0A), CR ( 0x0D ), and nine more are coded as ASCII, and these byte values ​​are used only for coding these control characters.
  • BOCU -1 permits random access to a limited extent.
  • BOCU requires similar amount of space as traditional fonts in Unicode or as SCSU for normal texts.
  • BOCU -1 requires up to 4 bytes per character.

A number of properties have a negative effect on the practical utility:

  • Although the algorithm has been expressly designed BOCU easier than SCSU, it takes much longer in practice.
  • BOCU -1 is not backwards compatible with ASCII. Although need texts that contain non-ASCII characters in the BOCU - 1 encoding the same memory location, but are represented by different byte values ​​. This is especially a problem when the character encoding as in XML is to be specified in the document itself.

Swell

  • Markus W. Scherer, Mark Davis: Unicode Technical Note # 6: BOCU -1: MIME compatible Unicode Compression. ( Online)
  • Doug Ewell: Unicode Technical Note # 14: A Survey of UnicodeCompression. ( Online)
  • Patent US6737994: Binary -ordered compression for unicode. Registered on 13 May 2002, published on 13 November 2003, Applicant: IBM, Inventor: Davis, Mark Edward; Scherer, Markus Walter.
125845
de