Combining character

Combining characters (English combining characters / marks ) are in digital typography special characters that are normally not shown separately, but are connected to the previous characters to a single character. This is mainly used to form any diacritical mark. So, for example, gives the lowercase letter y followed by the character combining breve y, a sign that could not be represented in Unicode without combining characters. Conceptually, therefore can compare the combining characters with dead keys on the keyboard.

Formal Foundations

Scope and use of Recombinant characters differ between different character encodings. So ISO 6937 recognizes a series Recombinant characters for diacritical marks, but only allows certain combinations. Therefore, for a complete representation, it is sufficient if the font used provides own glyphs for these combinations. Alternatively, the encoding can also be interpreted as an encoding in the simple letter by one byte, letters with diacritics, however, are represented by a sequence of two bytes. In this standard, the combining characters are preceded by deviating from the usual behavior of the letter with which they are combined.

Combining characters are used not only for diacritical marks, so use the codes from ISCII -1988 for combining various Indian scriptures characters for vowels.

The most extensive collection of combining characters provides Unicode along with a set of rules for their presentation. Unicode allows doing any combination of base characters and combining characters, it may also follow a number of combining characters to a base character. For the presentation, it is therefore not sufficient if the font contains some Zusatzglyphen, but also information about the dimensions of the individual characters are needed to assemble the basic character with the combining characters. This is approximately realized by the OpenType concept.

In the Unicode Standard combining characters by their character class (General Category) M are marked. This in turn is divided into three distinct classes: Nonspacing Mark (Mn ) for combining characters that do not own space usually need (about diacritics ) Enclosing Mark (Me) for combining characters that are completely inside the base characters, and spacing combining mark (Mc) for combining characters that need own space (about Indian vocal characters).

Further, each character is assigned a combining class property. This is an integer between 0 and 255 that indicates the position substantially at the combining character is added to the basic character. So have about all combining characters that are placed over the base characters, the value 230, characters that are under the basic characters, the value 220 In normal, non -combining characters, the value is always 0, but there are also some combining characters with this value.

Representation

The Unicode standard makes few reliable statements about how programs are intended to represent strings with combining characters. However, there are listed the following recommendations:

Follow a basic character more combining characters, so should be attached to the outside from the inside after this turn. Thus results in the consequence of an a in which there is still a tilde above the caret (A ), whereas < Latin small letter a U 0061, combining tilde U 0303, U 0302 Combine Santander circumflex > is inversely related to the circumflex over the tilde ( ã ). An important exception to this principle are about accents in the Greek. As a result the Gravis should not be above the comma, but behind it ( ἂ ). Even with the special characters kombinierendem Combining Grapheme Joiner, a departure from the usual stack will be enforced.
If follow more combining characters to each other that are attached at various locations on the base characters (such as up and down, more specifically to this depends on the combining class property), so the order may not matter, the result must be the same in both. appearance So give and < Latin small letter a U 0061, Combine Santander point as a character U 0323, Combine Santander point as a character U 0307 > both an a with a dot above and one below (A).
If the typographic tradition is the diacritic to a different location, so it is possible. So usually a comma under g is shown as an inverted comma over the g.
The points of i, j and some other characters with the Soft_Dotted property be removed.
Ideally, a program based in the positioning of Recombinant characters at the exact appearance of the basic letters, so an accent over a capital letter normally will sit higher than with a lowercase letter. The standard is clear, however, that even a simple positioning is acceptable in the same place.

For the presentation of Recombinant characters in Indic scripts in Unicode, there are specific, comprehensive rules.

In some cases you want to diacritical marks, spread over two or more basic characters extend. There are two techniques:

Firstly, there are so-called double combining characters that are not only as normal combining characters on the preceding base characters extend, but also about the character following the double combining characters. So are about an over -stressed by a tilde ng: ng.

On the other hand, there are specific combining half of a character. Here is the first half of the first basic character, the second to the second. Thus, one can represent ng with tilde by , this also results in n ︢ g ︣.

To represent a combining character on its own, you should precede it with a nonbreaking space. The earlier recommendation to take an ordinary space, was rejected due to problems with the processing of such spaces in XML and in other contexts. For many diacritics, there are nichtkombinierende variants in Unicode Spacing Modifier Letters block. In technical documentation combining characters are often depicted with a dotted circle ( ◌ ), this indicates the position at which the combining characters is given to the basic characters.

Ambiguous representations

The concept of combining characters means that there are characters that can be represented in several different ways by signs. This has two causes:

On the one hand there is a separate character for many common combinations of base characters and diacritical marks. Thus, a ñ can be represented as , but there is also a separate character Latin small letter n with tilde at code point U 00 F1.

On the other consequences arising from combining characters that do not interact with each other, the same character.

Overall, the small a number of different representations can be very large, for A, with a circumflex and a point below it are about the following display options:

In order to arrive at a unique representation ( for example, if you want to know if two words are the same), there are different normalizations. For this purpose is specified in the standard to each character, whether it can be decomposed into a base character and combining characters, and if so, how. Initially, all the characters are decomposed to the specified type, then follow Recombinant characters that do not interact with each other according to their Combining_Class property order. This provides the canonical decomposition ( NFD ).

Encoded characters in Unicode

Currently (as of Unicode 6.1, January 2012) defines the Unicode Standard in 1645 combining characters, spread over several blocks.

The two blocks combining diacritical marks and combining diacritics, supplements contain diacritics, which are provided for letters of all alphabets.

The Unicode block combining diacritical marks for symbols also contains combining characters, but these are intended for use with symbols. So one can compose Warning: results ⚡ ⃤.

The combining half the characters are in Unicode block Combining Half Marks.

Many other blocks also contain combining characters that are intended specifically for use with the other characters in this block. Thus, the combining characters for titlo and other Cyrillic diacritical marks are in Cyrillic block.

ISO/IEC 6937 Indian Script Code for Information Interchange Dot (diacritic) Brahmic scripts in Unicode

198265