Unicode control characters

Control characters in Unicode are themselves non-displayable characters, but influence the display of formatting and the other characters. Since Unicode encoding a variety of different writing systems that provide specific requirements for an optimum experience, it is necessary in some cases to influence by such invisible control characters the representation algorithms.

With control characters, for example, the representation of ligatures are affected. Depending on whether a program automatically provides or not, it may be necessary in some cases, by means of certain control characters to require a combination of two letters in a ligature or to prevent such.

General characteristics of control characters

Most of the control character are characterized by the general category as such and there are some exceptions. Here is the value of Cc for general, Cf for formatting control characters. Many control characters are also labeled as default ignorable, this means that programs can not process these signs, they should not be observed.

Control character ranges C0 and C1

As C0 range as the characters U 0000 to U 001 F are (decimal 0-31 ) and U 007 F (decimal 127 ), shall form the C1 range from U 0080 to U 009 F (decimal 128-159 ). As a superset of ASCII and Latin-1 accepts Unicode, the C0 and C1 control characters of these standards, without having to impose its own interpretation. Only some of these characters have a defined in the Unicode standard function of these bodies are the characters for line break.

Radical change

For the line break and the division of a text into individual characters, words or sentences, there is the Unicode line breaking algorithm and a number of segmentation algorithms. In addition to the classical control characters on the forced line end there is also control characters that can be used to signal these algorithms, at which points in the text without wrapping may take place and where it should be also possible.

To prevent an upheaval, the word connector (U 2060 ) is usually used, unless there is a separate non-breaking models like spaces. Before the introduction of this control character in Unicode 3.2, the breadthless nichtumbrechende space ( U FEFF ) was used, but which is now used mainly in his capacity as Bytereihenfolgezeichen.

Conversely, to allow for a break, the broad -free spaces (U 200 B) or the conditional hyphen (U 00 AD) is being used.

For Line and paragraph end also exist in the Unicode line separator character (U 2028 ) and paragraph separator ( U 2029 ), which, in contrast to most other control characters in their general category referred to as white space.

Script and ligatures

In some writing systems, such as Arabic, the characters are connected within a word with neighboring characters, which means that depending on the position can have a different appearance a sign. It is also possible that two adjacent characters to be represented by a single ligature of the character. In such cases, in order to force the connection of two adjacent characters or prevent defines the Unicode standard control characters that affect the corresponding algorithms.

These are the binding inhibitor ( U 200 C) and the wide loose connector (U 200 D).

Combining Grapheme Joiner

Formally, no control characters, but a combining character is the combining grapheme joiner ( CGJ, U 034 F), which can be used to directly influence the display of diacritics and digraphs by means of the sorting of Unicode Collation Algorithm.

Bidirectional text

For bidirectional text, there are a number of special control characters that can enforce a particular writing direction, and so influence the presentation.

Outdated formatting characters

Some control characters are marked as deprecated, their use is discouraged. These are the following characters:

U 206 A ( symmetric mirroring prevent ) and U 206 B (symmetric mirroring enabled ) disable or enable normal behavior that spiegelbare characters (such as parentheses ) are presented in mirrored left-handed text when using the Unicode bidi algorithm.

U 206 C ( Arabic shaping prevent ) and U 206 D ( Enable Arabic shaping ) disable or enable to replace the normally deactivated behavior, arabic compatibility characters for certain character shapes by the respective context actually correct form.

U 206 E ( national digit shapes ) and U 206 F ( nominal digit shapes ) to enable or disable an otherwise non-implemented replacement of ordinary numbers 0 to 9 on the issue by the usual in the user's language (Arabic, Indian, etc.).

Variantenselektoren

Variantenselektoren offer the possibility also in plain text to use certain glyphs for output without meta data of the desired font. Formally Variantenselektoren combining characters, so immediately follow the sign for which they select a specific variant form. There are 259 different defined such Variantenselektoren: U 180 B to U 180 D are designed for use with Mongolian characters, U FE00 to U and U E0100 FE0F to U E01EF for general character. Cause exactly what changes the Variantenselektoren is specified in two documents, in the Unicode Ideographic Variation Database and StandardizedVariants.txt file. For example, sets of U FE00 Variantenselektor if he follows the character U 222 A Association, stated that this is to be presented with serifs.

Locked code points

Some code points are permanently disabled and will never be occupied by a character. Apart from the last two code points of each plane (U FFFE, U FFFF, U 1 FFFE, U 1 FFFF, ... U 10 FFFE, U 10 FFFF ) these are the characters in the range U to U FDD0 FDEF. The sequence of bytes FFFE must remain free to recognize the sequence of bytes of the byte order mark (U FEFF ) can and the sequence of bytes FFFF ( all 16 bits set) is indistinguishable from a lack of signal at various data transfers. The other code points correspond to bit strings, which are needed for code internal purposes. These code points so it is not about control characters in the narrow sense and programs can use these code points internally arbitrary, but for the transmission and display of characters they are not suitable. They are not to be confused with currently unused code points, which, however, in later versions of a character could be assigned.

Bytereihenfolgezeichen

In addition to its original meaning for the character U FEFF upheaval has now the task to specify the byte order of a text as Bytereihenfolgezeichen and to facilitate automatic determination of the coding.

Note characters

The characters in the range U to U FFF9 FFFB from the Unicode block special permit to insert annotations in the text, most of which are shown above the annotated text. They allow about furigana characters to be labeled as such. We are led U FFF9 (Inter Linear Note anchor ) the annotated text, U FFFA (Inter Linear Note divider ) separates it from the following note to him, U FFFB (Inter Linear Note conclusion characters) marks the end of the annotation.

Deprecated Tags

The Unicode block tags (U E0000 to U E007F ) contains characters that were originally intended to provide language and other meta-information in plain text by tags. These characters are now deprecated in favor of higher-level protocols such as XML. 95 of these characters correspond to printable characters in the ASCII standard, plus get some more characters that define the type of meta-information or the end of their effect. So sets the result finds that the following text is Japanese: U E0001 initiates voice tags, the next two characters ( after subtracting E000016 ) read as in ASCII as yes are the ISO 639 language code for Japanese.

749393
de