Unicode character property
The Unicode Standard encodes not only a very large number of characters, but establish, for each of these characters, a number of properties that describe the character and behavior. So you can refer to the properties of the letter Ä about that it is a capital letter that the corresponding lower-case letter is the ä or that it can be decomposed into an A with diaeresis.
- 2.1 General 2.1.1 General category
General
Formal Unicode properties are defined as pictures of code points in a certain range of values. The data are provided in a variety of simple text files as well as an XML file.
Values
Depending on the property several ranges of values are possible. Most properties are enumerative properties, their values range consists of a fixed amount. Enumerated properties are again subdivided into catalog properties and binary properties. Catalog properties are characterized by the fact that with the new Unicode versions of the set of possible values increases gradually. Binary properties are enumerative properties with exactly two values , true ( Y) and false ( N). Thus, it is indicated whether the property on this sign is correct or not.
In addition, there are string properties that assign each character a string of Unicode characters, numeric properties that each character assigned a number, and other properties that can be assigned to any of these categories.
Defaults
Properties have a number of reasons or more defaults. Firstly, in the tables, the default value is often omitted in order to make this clearer. Second, programs must also deal with text that was created by a newer Unicode version, and therefore can contain characters that were not yet occupied at the time when the program was developed. For enumerative properties of a value for each is usually set, which is regarded as standard, in a few cases there are several default values that are assigned depending on the block. For binary attributes, the default is always N, that is not true.
For string properties, the default value is always the sign itself
Aliases
Many properties have an addition to their actual name also or more aliases. Often it involves abbreviations. Also for the possible values aufzählender properties are often short aliases defined.
Status
Many properties are normative, ie binding for programs that operate according to the Unicode standard and interpret the property. Other properties, however, are marked as informative and serve only as additional information without binding. A group of properties is marked as beisteuernd. These characteristics should not be used alone, but were defined in order to derive other properties of it. They usually feature an exceptional set of characters that would not otherwise be detected. At last, there are temporary features that were initially recorded, subject to see if they work in practice.
Some properties are in addition ( " obsolete " ) marked as deprecated, this should no longer be used for different reasons, but remain for backward compatibility in the Unicode standard exist.
Stability
To ensure backward compatibility, some properties, once they are set for a sign, not changed or only in certain previously known manner. So is approximately determined that the name of a character is never changed, even if he turns out to be wrong.
Properties
The following lists lead to all Unicode properties, grouped as in the official documentation, the state Unicode 6.3. Indicated are the name of the property, a abkürzender alias name ( if any), the status of the property, the type of the value range and a description.
Generally
The general properties give a rough overview of the character. They are used, inter alia, in regular expressions, if they support something like Perl querying Unicode properties.
General category
The General_Category property is one of the basic properties, which is used both in the Unicode standard itself as well as in many other technical documentation. It divides all the characters according to their principal use in letters, numbers, punctuation and more. The following table lists the possible values .
Case
Many properties are concerned with the case. You determine whether a character is an uppercase or lowercase letter, which is the lowercase letter at a given capital letters and vice versa, and more. To compare strings spell independently, one designated as case fold normal form is defined. These properties can be used, inter alia, by the various Unicode casing algorithms.
Numerical
The following properties are concerned with the numerical properties of characters, particularly the number of characters in Unicode.
Normalization
A number of properties discussed the different types of normalization of Unicode text.
Representation
The following properties play a role in the appearance of text.
Bidi
For the presentation of bidirectional text following properties are available.
Identifiers
The following properties are a way to define the allowed characters in identifiers. Unlike traditional programming languages that allow non-ASCII characters, a lot of the Unicode characters in identifiers are in languages that use these properties allowed. An example of a language whose syntax largely allows this extent, Javascript.
CJK
Several properties apply CJK characters. In addition there are a number of other properties, see the section Unihan.
Others
Some properties are primarily for information about a character without that they are intended for specific applications.
Contributing properties
These properties are not used alone, but can be used to derive other characteristics thereof. Most are exceptional amounts that are not covered by the general category.
Unihan
For CJK characters that were recorded during the Han unification in Unicode, there is a separate database that provides properties specific to this character. The information on the source denote the character encoding in various national character sets. In addition to the properties listed here, there are a number of other temporary features that further indications about pronunciation, meaning, alternative encodings supply etc..
Swell
- Mark Davis, Ken Whistler: Unicode Standard Annex # 44: Unicode Character Database. ( Online)
- John H. Jenkins, Richard Cook, Ken Lunde: Unicode Standard Annex # 38: Unicode Han Database. ( Online)
- Ken Whistler, Asmus Freytag: Unicode Technical Report # 23: The Unicode Character Property Model. ( Online)