Unicode character property

The Unicode Standard encodes not only a very large number of characters, but establish, for each of these characters, a number of properties that describe the character and behavior. So you can refer to the properties of the letter Ä about that it is a capital letter that the corresponding lower-case letter is the ä or that it can be decomposed into an A with diaeresis.

  • 2.1 General 2.1.1 General category

General

Formal Unicode properties are defined as pictures of code points in a certain range of values. The data are provided in a variety of simple text files as well as an XML file.

Values

Depending on the property several ranges of values ​​are possible. Most properties are enumerative properties, their values ​​range consists of a fixed amount. Enumerated properties are again subdivided into catalog properties and binary properties. Catalog properties are characterized by the fact that with the new Unicode versions of the set of possible values ​​increases gradually. Binary properties are enumerative properties with exactly two values ​​, true ( Y) and false ( N). Thus, it is indicated whether the property on this sign is correct or not.

In addition, there are string properties that assign each character a string of Unicode characters, numeric properties that each character assigned a number, and other properties that can be assigned to any of these categories.

Defaults

Properties have a number of reasons or more defaults. Firstly, in the tables, the default value is often omitted in order to make this clearer. Second, programs must also deal with text that was created by a newer Unicode version, and therefore can contain characters that were not yet occupied at the time when the program was developed. For enumerative properties of a value for each is usually set, which is regarded as standard, in a few cases there are several default values ​​that are assigned depending on the block. For binary attributes, the default is always N, that is not true.

For string properties, the default value is always the sign itself

Aliases

Many properties have an addition to their actual name also or more aliases. Often it involves abbreviations. Also for the possible values ​​aufzählender properties are often short aliases defined.

Status

Many properties are normative, ie binding for programs that operate according to the Unicode standard and interpret the property. Other properties, however, are marked as informative and serve only as additional information without binding. A group of properties is marked as beisteuernd. These characteristics should not be used alone, but were defined in order to derive other properties of it. They usually feature an exceptional set of characters that would not otherwise be detected. At last, there are temporary features that were initially recorded, subject to see if they work in practice.

Some properties are in addition ( " obsolete " ) marked as deprecated, this should no longer be used for different reasons, but remain for backward compatibility in the Unicode standard exist.

Stability

To ensure backward compatibility, some properties, once they are set for a sign, not changed or only in certain previously known manner. So is approximately determined that the name of a character is never changed, even if he turns out to be wrong.

Properties

The following lists lead to all Unicode properties, grouped as in the official documentation, the state Unicode 6.3. Indicated are the name of the property, a abkürzender alias name ( if any), the status of the property, the type of the value range and a description.

Generally

The general properties give a rough overview of the character. They are used, inter alia, in regular expressions, if they support something like Perl querying Unicode properties.

General category

The General_Category property is one of the basic properties, which is used both in the Unicode standard itself as well as in many other technical documentation. It divides all the characters according to their principal use in letters, numbers, punctuation and more. The following table lists the possible values ​​.

Case

Many properties are concerned with the case. You determine whether a character is an uppercase or lowercase letter, which is the lowercase letter at a given capital letters and vice versa, and more. To compare strings spell independently, one designated as case fold normal form is defined. These properties can be used, inter alia, by the various Unicode casing algorithms.

Numerical

The following properties are concerned with the numerical properties of characters, particularly the number of characters in Unicode.

Normalization

A number of properties discussed the different types of normalization of Unicode text.

Representation

The following properties play a role in the appearance of text.

Bidi

For the presentation of bidirectional text following properties are available.

Identifiers

The following properties are a way to define the allowed characters in identifiers. Unlike traditional programming languages ​​that allow non-ASCII characters, a lot of the Unicode characters in identifiers are in languages ​​that use these properties allowed. An example of a language whose syntax largely allows this extent, Javascript.

CJK

Several properties apply CJK characters. In addition there are a number of other properties, see the section Unihan.

Others

Some properties are primarily for information about a character without that they are intended for specific applications.

Contributing properties

These properties are not used alone, but can be used to derive other characteristics thereof. Most are exceptional amounts that are not covered by the general category.

Unihan

For CJK characters that were recorded during the Han unification in Unicode, there is a separate database that provides properties specific to this character. The information on the source denote the character encoding in various national character sets. In addition to the properties listed here, there are a number of other temporary features that further indications about pronunciation, meaning, alternative encodings supply etc..

Swell

  • Mark Davis, Ken Whistler: Unicode Standard Annex # 44: Unicode Character Database. ( Online)
  • John H. Jenkins, Richard Cook, Ken Lunde: Unicode Standard Annex # 38: Unicode Han Database. ( Online)
  • Ken Whistler, Asmus Freytag: Unicode Technical Report # 23: The Unicode Character Property Model. ( Online)
523597
de