Unicode equivalence

The Unicode standard includes various normal forms of Unicode strings and algorithms for normalization, ie to convert a string in such a normal form. The normalization is necessary because there are several different ways for many characters to display them as a string of Unicode characters. Only if to compare strings are in the same normal form, it is possible to decide if they represent the same text or not.

Normal Forms

There are four normal forms: Two for canonical equivalence, two for compatible, both in each of a disassembled form and a combined.

When two strings are equivalent canonical, then they represent exactly represent the same content, just different consequences of Unicode characters may be selected. The fact that there are multiple representations of several reasons:

  • For many letters with diacritical characters in the Unicode standard defines its own character. Such characters can be well represented as the base letter followed by a combining diacritical marks.
  • Follow on a different character combining characters that are at different points of the basic character, so their order does not matter.
  • Some characters are included twice in the standard, such as Å, which both the position U 00 C5 is coded as " Latin capital letter A with ring " as well as at the point U 212 B as " Ångströmzeichen ".

If two strings only compatible equivalent, so they do indeed represent the same content, but the presentation may be slightly different. Deviations may occur:

  • Hoch-/Tiefgestellte sign: The superscript numeral 2 (², U 00 B2 ) is a compatibility version of section 2, as well as the subscript ( ₂, U 2082 ).
  • Other Font: The capital letter Z with a double slash ( ℤ, U 2124 ) corresponds to the usual Z, only he is in a different font before.
  • Initial / medial / final / isolated form of a character: Although is only one character for each letter in Arabic to Unicode, even if it depending on the position has a different shape, but also each of these forms are coded as separate characters. These characters are marked as appropriate compatibility versions of the Unicode Standard preferred character.
  • Without break: Some characters differ only when using the Unicode line breaking algorithm from each other, so the non-breaking space is a space that allows no break.
  • Circled: The Unicode block Sealed alphanumeric characters and other blocks contain many circled characters, such as the circled numeral 1 (①, U 2460 ), which is a variant of the usual 1.
  • Breaking: breakthroughs such as ½ (U 00 BD) can also be written using the fraction bar (U 2044 ).
  • Different width or alignment, square: The East Asian typography knows characters in various widths, and those rotated by 90 ° in the vertical layout, which appear over the usual representation.
  • Other: Some normalizations fall into any of these categories, including the resolution of ligatures.

In the decomposed form, all characters that can be represented Recombinant characters with the help, disassembled, in the composite form a single character for a sequence of base characters and combining characters is selected if this is possible.

The four normal forms are: the canonical decomposition ( NFD ), the canonical decomposition followed by canonical composition (NFC ), the compatible decomposition ( NFKD ) and the compatible decomposition followed by canonical composition ( NFKC ).

Normalization

The conversion of a string in one of the four normal forms is called normalization. To perform this, the Unicode standard defines several properties:

  • Decomposition_Mapping indicates to each character string in which it can be decomposed, if possible. The property is called both the canonical and the compatible decompositions.
  • Decomposition_Type indicates whether it is in the decomposition to a canonical or a compatible decomposition. In the latter case also indicates the kind of this.
  • Canonical_Combining_Class (short ccc ) is a number between 0 and 254 that specifies for combining characters in about, at which point the basic character they are. If two combining characters different values ​​so they do not interact with each other and can be interchanged.
  • A sign has the Full_Composition_Exclusion property, if indeed it has a canonical decomposition, but in the composite normal forms should nevertheless not be used.
  • Hangul_Syllable_Type is used in Korean for the separation of syllables.

For conversion into one of the normal forms one performs the following steps:

In the first step, the character string is completely disassembled: Each character is determined whether a partition exists and to be replaced it by them. This step is repeated run because the characters, in which a character can be disassembled, can even be dismantled again.

For the canonical normal forms only canonical decompositions are used with compatible normalizations, both the canonical and the compatible. The separation into individual Jamo Korean syllables is in this case carried out by a separate algorithm.

Then the combining characters are sorted: Follow two characters A and B aufeineinander for which it holds ccc (A) > ccc (B )> 0, then these two characters are swapped. This step is repeated until there are no more pairs of characters, which can be interchanged.

For the composite normal forms followed by a third step, the canonical composition: purpose, (starting with the second character ), for each character C if a previous character L has the following properties:

  • Ccc (L) = 0
  • For all characters between A L and C 0 < ccc (A) < ccc (C )
  • There is a Unicode character P, which is not marked as Full_Composition_Exclusion and the canonical decomposition has.

In this case L is removed replaced by P and C.

To get from consequences of Jamo again syllables with their own Unicode code points, the algorithm for decomposing the syllables is applied vice versa.

Properties

Text consisting only of ASCII characters, is present in all normal forms, text from Latin -1 characters in NFC.

The concatenation of two strings in normal form is not necessarily also in normal form, also can convert between lowercase and uppercase letters fall from the normal form.

All normalizations are idempotent, they used a second time, then the string does not change. In addition, each sequence of normalization can be replaced by a single normalization. This is a compatible normalization, if one of the normalizations involved is compatible, otherwise it is canonical.

The Unicode standard provides some properties that make it possible to efficiently test whether a given string is present in normal form or not.

Stability

For backward compatibility is guaranteed that a string is in a normal form, will be available in future versions of the Unicode standard in normal form, provided it contains no unassigned characters.

Since version 4.1, in addition guarantees that the normalization itself does not change, before that there had been some corrections which meant that strings had different normal forms in different versions.

For applications that require absolute stability, also in this version boundary, there are simple algorithms to switch between the different normalizations.

Applications

The most common normal form in NFC applications. It is, among other things by the World Wide Web Consortium recommended for XML and HTML and JavaScript used for by the code is transferred prior to further processing in that form.

The canonical normalizations ensure that equivalent data is not persisted in different forms, and thus ensure a consistent data management.

The compatible normalizations can be used for example for a search in which it is not to arrive at small optical differences. General normalizations can build on the Unicode normalization.

Swell

  • Julie D. Allen et al.: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, in 2012. ISBN 978-1-936213-07-8. (online)
  • Mark Davis, Ken Whistler: Unicode Standard Annex # 15: Unicode Normalization Forms. Revision 37
600976
de