Unicode Collation Algorithm

The Unicode Collation Algorithm ( UCA short ) is published by the Unicode Consortium algorithm to compare strings of Unicode characters and to arrange alphabetically. He is deliberately kept open so that language-specific features and special user requirements can be taken into account. For algorithm also has a table with default values for the sort available to the Default Unicode Collation Element Table ( Ducet ). In addition, provides the Common Locale Data Repository corresponding tables for many other languages , which are used for example in the implementation of the ICU project.

4.1 Examples 4.1.1 Simple letters
4.1.2 accented characters, ligatures and letter combinations
4.1.3 Control Characters, symbols and digits
4.1.4 Set and spaces

Requirements

The large number of characters that are encoded in Unicode, making it difficult to sort strings. So is not clear about whether words from Greek letters should be from such Cyrillic letters or vice versa. Also corresponds to the order in which the individual characters are encoded, are not always at the desired order of the sort.

There are different views about the sequence of individual letters in different languages. For example, words are in German encyclopedias such sorted with å, as there would be an ordinary a, while in Scandinavian languages, the å is a separate letter that comes after z. Even within a language such differences can occur, for example in German, where the ä partly as an a, partly as a ae is treated.

It also happens that can not be sorted easily by each character. Thus, the digraph ch is in the traditional Spanish treated as a separate letter that comes between c and d.

In addition, certain additional requirements may be imposed depending on the application, for example, St. could be sorted as Santa.

History

Authors of the algorithm are Mark Davis and Ken Whistler. The first version was published on 30 March 1997. For as of September 2012, the algorithm is available in version 26. With ISO 14651, there is a similar algorithm of the ISO, but offers fewer options. With newer versions, more and more configuration options were added, such as the possibility to sort numbers numerically.

Algorithm

To compare two character strings, the algorithm proceeds in several steps, and compares the two strings at different levels. The number and meaning of levels can be individually chosen, however, are standard three levels with the following meanings:

Level 1: basic letter

On the first level, the strings are compared according to their basic letters. Accents, capital and small letters, punctuation marks, and similar agents are usually ignored. So the words trash and Mull are considered the same at this level, the word Muli comes but before that.

Level 2: Accents

If the words in the basic letter agreement are compared next accents. In almost all languages is thereby sought from left to right after the first difference, and then after a fixed order by: unaccented letters first, the other accents in a fixed order. This results in approximately the order cote - coté - côte - côté. An exception is the Canadian French: Here is traditionally after the last difference by: cote - côte - coté - côté.

Level 3: Upper and lower case

If the words in the accents match, is sorted according to the case- insensitive, taking mostly the lower-case letters are sorted before the uppercase letters.

Further levels

If required, additional levels can be connected to allow an even finer distinction. It is frequently ranked as statements according to the individual code points. This ensures that two different strings are always sorted in the same order.

Grading weights

To compare strings provides the algorithm to a string of Unicode characters a binary sort key. This key is then used in the actual sorting algorithm in the comparison. To determine the sort key a table is used, which lists for individual characters or character combinations for all levels of binary weights, called a Collation Element Table. An entry can thereby also be assigned multiple weights per level. A weight of 0000 means that the corresponding character is to be ignored at this level. For characters that are not listed in the table, a weight has to be calculated. For the standard table, this is only the case at CJKV characters, for the calculation of the weights, a specified algorithm.

First, the string is broken into pieces as long as possible, have an entry in the table, and read the corresponding weights from the table. These weights are first hung for each level individually to each other, being 0000 omitted. For a Canadian- French sorting the order in level 2 is reversed. The key for each level are finally hung separated by 0000 into a single sort key together.

Examples

Most of these examples of entries of a collation element table are taken from the Ducet in version 6.1.0. Indicated here are the respective weights of the first three levels as hexadecimal numbers. The weights are separated by dots and enclosed in square brackets.

Simple letters

Accented characters, ligatures and letter combinations

Letters with diacritical characters are decomposed into a base character and combining characters following.

Control characters, symbols and numerals

Set and spaces

With block and spaces there are different ways to choose the weights.

In many implementations, such as PHP, these characters are weighted as indicated in the table.

It was originally planned, these characters not to be considered as the default behavior, so as to give them control characters the weight [ 0000.0000.0000 ].

In the current version of the algorithm, something similar is required as the default behavior: Also here is chosen as weight [ 0000.0000.0000 ], but is it still added a fourth level, used in weight than the specified value actually for level 1, while others the characters in this level receive the highest possible value FFFF.

In addition, there are further variations between which the user can choose.

Adaptation

The default table may be adjusted in various ways:

It can be language-specific changes to the individual weights are made and additional combinations will be added with weights. For many languages according to custom tables are already available. For customizations will have a separate syntax to specify this, which can be translated by appropriate programs in a table with corresponding sorting weights.
If required, can be specified that at the second level is to be sorted from the end of the string, as it is usual in the Canadian French. In principle, this is also possible for other levels, if not meaningful.
It can be chosen to be held on as many levels of comparison. This number is called strength. The default is 3, but if the table specifies weights for more levels, a larger number can be selected. It can also be a smaller number can be selected when a short sort key has priority over a detailed sorting.
For certain characters (usually spaces and punctuation) weights can choose between different versions of the.

The standard describes still many more options.

Variants

There are various methods to modify the algorithm, as in order to obtain shorter binary key. Thus, it is possible to dispense with the divisive 0000, provided that the weights of level decrease to level. In addition, there are other variants that lead to greater savings.

Example

It should define the Nina, Nino, Nino, Nino and Ninu be placed in alphabetical order. They are broken down into individual letters, determines their weights, and it composed the sort key. In the key the second level is the part that belongs to the first level, in blue, green, the third yellow.

Nina comes first, the word is already differ on the first level from the following word Nino (underscore ). This, in turn, agrees to the first two levels with NINO, until the third level there is a difference. The next word Niño, the key is longer than in the other words, because of the tilde in the second and third level, that Tilde also provides the first difference to the previous word. At last comes Ninu, which in turn is different already at the first level of the preceding words.

If you were to sort these words by code points, instead, would result in the order NINO - Nina - Nino - Ninu - Niño.

Search algorithm

Parts of the algorithm can be found, for example if words are to be found with ß with a search for ss also for text Search application. A match is then to be recognized in this case, if the search word and the possible matches on the first level match.

International Components for Unicode Greek alphabet Digraph (orthography) Ch (Digraph) Colon (punctuation) Full stop Apostrophe

792533