Unicode

Unicode ( Pronunciations: am English [ ju ː nikoʊd ], British English [ ju ː nikəʊd ], German [ ju ː niko ː t]) is an international standard, in the long term for each meaningful character or text element of all known writing cultures and systems of signs, a digital code is determined. The goal is to eliminate the use of different and incompatible encodings in different countries or cultures. Unicode is constantly supplemented by characters other writing systems.

ISO 10646 is used by ISO, virtually meaningless same name of the Unicode character set; it is designated as a Universal Character Set (UCS ).

  • 6.1 Code point input methods 6.1.1 Direct input at the operating system level 6.1.1.1 Microsoft Windows
  • 6.1.1.2 Apple

History

Conventional computer fonts comprise only a limited supply of characters in western character encodings, this limit is usually at 128 (7 -bit) code positions - as in the ASCII standard well-known - or 256 (8 bit) positions, such as in ISO 8859-1 (also known as Latin-1 ) or EBCDIC. After deducting the control character 95 items in ASCII and 191 elements in the 8 -bit ISO character sets other than characters and special characters can be represented. These character encodings permit the simultaneous display of only a few languages ​​in the same text, if you are not so makes do to use different fonts with different character sets in a text. The disabled international data exchange in the 1980s and 1990s significantly.

ISO 2022 was a first attempt to represent multiple languages ​​with one single encoding setting. The encoding used escape sequences to switch between different character sets (eg between Latin-1 and Latin -2). However, the system was up only in East Asia.

Joseph D. Becker of Xerox in 1988 wrote the first draft of a universal character set. This 16- bit character set should be according to the original plans, only the character of modern languages ​​encode:

"Unicode gives higher priority to Ensuring utility for the future than to preserving past antiquities. Unicode AIMS in the first instance at the characters published in modern text (eg in the union of all newspapers and magazines printed in the world in 1988), Whose number is undoubtedly, far below 214 = 16,384. Beyond Those modern -use characters, all others ' may be defined to be obsolete or rare, thesis are better candidates for private -use registration than for congesting the public list of gene - rally useful Unicode. "

" Unicode provides a higher claim on ensuring the usability for the future to get as past antiquities. Unicode aims primarily to all the characters that are published in modern texts (about in all the newspapers and magazines in the world of the year 1988), whose number is undoubtedly far below 214 = 16,384. Other characters that go beyond these today characters can be considered obsolete or rare, they should be better registered a private mode, rather than the public list of generally useful Unicodes to overfill. "

In October 1991, the version 1.0.0 of the Unicode standard was published after several years of development, which was then coded only the European, Middle Eastern and Indian scriptures. Only eight months later, after the Han unification was completed, published version 1.0.1, the encoded East Asian characters for the first time. With the release of Unicode 2.0 in July 1996, was the standard from the original 65,536 to today's 1,114,112 code points U 0000 to U extends from 10 FFFF.

Versions

Content of the standard

The Unicode Consortium provides several documents to support Unicode. Apart from the actual character set, these are furthermore other documents that are not absolutely necessary, but helpful for the interpretation of the Unicode standard though.

Structure

In contrast to previous character encodings, which usually encoded only one writing system, it is the goal of Unicode to encode all characters and writing systems in use. The character size is this in 17 levels (English planes) divided which each comprise 216 = 65,536 characters. Six of these planes are already in use, the rest are reserved for future use:

  • The Basic Multilingual Plane (BMP; German Multilingual base level, also known as Plane 0 is ) mainly includes writing systems that are currently in use, punctuation and symbols, control characters and surrogate pairs, and a private usable area ( PUA). The block is highly fragmented and largely occupied, so find new to coding systems of writing here no place. Access to other levels than the BMP is not yet or only partly possible in some programs.
  • The Supplementary Multilingual Plane (SMP; German Supplementary multilingual plane as Plane 1 hereinafter) was introduced with Unicode 3.1. It contains mainly historical writing systems, but larger collections of characters that are rarely in use, such as Domino and Mahjonggsteine ​​and Emoji. Meanwhile, writing systems in the SMP are encoded, which are still in use, but can find no place in the BMP.
  • The Supplementary Ideographic Plane ( SIP; German Supplementary ideographic level, as referred Plane 2 ), which was also introduced with Unicode 3.1 only contains CJK characters that are rarely used, these include among others the Chu Nom, the earlier were used in Vietnam. If this is not enough level to ensure Plane 3 is reserved for other CJK characters.
  • The Supplementary Special -purpose Plane ( SSP; German Supplementary level for specific uses, as Plane 14 ) contains a few control characters to voice tag.
  • The last two levels, each Supplementary Private Use Area -A and- B (PUA; well Plane 15 and Plane 16), are available as private usable areas (PUA ) is available. They are also sometimes referred to as Private Use Plan ( PUP).

Within these levels belong together characters are grouped in blocks ( engl. blocks ). Usually a Unicode block handles a writing system, for historical reasons, however, a certain degree of fragmentation has been set. Often characters were later added and accommodated in other blocks as a supplement.

Code points and characters

Each encoded in the Unicode standard elementary character is a code point (English codepoints ) assigned. These are usually hexadecimal ( at least four digits, ie, with leading zeros ) presented and preceded by U, eg U 00 DF for the beta.

The entire area described by the Unicode standard includes 1,114,112 code points (U 0000 ... U 10 FFFF, 17 levels, each 2 ^ 16, ie 65536 characters). Of these, the standard offers, however, in some fields to use for character encoding not to:

  • 2048 code points in the range U D800 ... U DFFF be (U 10 FFFF ie in the range U 10000 ... ) are used as parts of surrogate pairs in the encoding scheme UTF -16 to represent code points above the BMP and therefore are not even as a code point for individual characters.
  • 66 code points, 32 in the range U FDD0 ... U FDEF and 2 each on the end of each of the 17 levels ( ie U FFFE ... U FFFF, U 1 FFFE ... U 1 FFFF, ... U 10 FFFE ... U 10 FFFF ) are reserved and are not intended for use as single characters for process -internal uses.

Thus, a total of 1,111,998 code points available for character encoding. However, the actual number of assigned code points is significantly lower; an overview of how many code points are assigned in the different versions each and how it is used, the Tables D -2 and D -3 is offered in Appendix D of the Unicode standard.

PUA ( "Private Use Area ", privately -usable area )

Special areas are reserved for private use, ie in these code points for standardized in Unicode characters can not be considered. These can be used for privately defined characters, which must be arranged individually between the producers and users of the texts they contain. These areas are:

  • In the BMP: U E000 ... U F8FF
  • In other levels: U F0000 ... U and U FFFFD 100000 ... U 10 FFFD

There have specific conventions for different applications developed specifically pretend character assignments for the PUA area of ​​the BMP. On the one here frequently found precomposed characters from basic characters and diacritics, as in many (especially older ) software applications can not be assumed that such characters are displayed correctly according to the Unicode rules when entering as a result of basic characters and diacritical characters. On the other hand, there are characters that are not the rules for inclusion in Unicode meet, or whose application for inclusion in Unicode was unsuccessful for other reasons or stayed. Thus we find in many fonts on the position U F000 a manufacturer's logo (logos are in principle not Unicode encoded).

Significant sources of PUA characters are:

  • MUFI ( Medieval Unicode Font Initiative)
  • SIL PUA world for special letters of various minority languages
  • Languagegeek for indigenous languages ​​of North America
  • ConScript for invented writing systems such as Klingon

Coding

In addition to the actual font also a number of character encodings are defined that implement the Unicode character set and can be used to have full access to all Unicode characters. They are (short UTF ) called Unicode Transformation Format; the most common are for a UTF- 16, as internal character representation of some operating systems ( Windows, OS X ) and software development frameworks ( Java,. NET) has established, on the other UTF -8, which is also in operating systems (GNU / Linux, Unix) and various internet services (email, WWW) plays a major role. Based on the proprietary EBCDIC format from IBM mainframes UTF- EBCDIC encoding is defined. Punycode is used to encode domain names containing non -ASCII characters. With the Standard Compression Scheme for Unicode an encoding format that compresses texts simultaneously exist. Other formats for encoding Unicode characters are, inter alia, CESU -8 and GB 18030th

Normalization

Many characters that are included in the Unicode standard, are so-called compatibility characters that can be represented in Unicode view already with other encoded in Unicode characters or character sequences, such as the German umlaut, the the theory with a sequence of base letter and a combining diaeresis (horizontal colon) can be represented. In the Unicode normalization compatibility characters are automatically replaced by the measures provided for in Unicode sequences. This facilitates the processing of Unicode texts considerably, since it is only one possible combination for a specific character, and not several different ones.

Sorting

For many systems of writing, the characters are not encoded in Unicode in an order that corresponds to a usual among the users of this writing system sorting. Therefore, when you sort, for example, in a database application can not normally be used the sequence of code points. In addition, the collations in many writing systems of complex, context-sensitive rules are marked. This defines the Unicode Collation Algorithm, as strings can be sorted within a particular writing system or writing across systems.

In many cases, however, is actually applicable order of other factors (eg, the language used ) dependent (eg sorted "ä" in the German application dependent as "ae " or "a" in Swedish but after " z" and "å " ) so that the Unicode collation algorithm is then applied if the order is not determined by more specific conditions.

Standardization institutions

The nonprofit Unicode Consortium was founded in 1991 and is responsible for the industry-standard Unicode. From the ISO ( International Organization for Standardization), the international standard ISO 10646 is published in cooperation with IEC. Both institutions work closely together. Since 1993, Unicode and ISO 10646 with respect to the character encoding virtually identical. While ISO 10646 only specifies the actual character encoding part of the Unicode a comprehensive set of rules, the more inter alia for all characters to apply and enforce important properties (called properties) clearly defines as sort order, reading direction and rules for combining characters.

For some time the code size of ISO 10646 corresponds exactly to Unicode, because even there the code area on 17 levels, was represented with 21 bits, limited.

Coding criteria

Compared to other standards there are in Unicode, the special feature that once encoded characters are never removed in order to ensure the longevity of digital data. If the normalization of a character later to be a mistake, is at best discouraged its use. Therefore, the introduction of a character required in the standard of a very careful examination, which can drag on for years.

In Unicode, only " abstract character" (English: characters) are encoded, but not the graphical representation (glyphs ) of these characters, which may be drastically different from font to font, the Latin alphabet in the form of Antiqua, fracture, Irish writing or the various manuscripts. For glyphs, whose normalization is shown to be useful and necessary, but 256 "Variation Selectors " are reserved precaution that can be adjusted, if necessary, the actual code. In many systems of writing characters also can take different forms depending on the position or form ligatures. With some exceptions (eg Arabic) such variants are also not included in the Unicode standard, but it is a so-called smart font technology as provided OpenType, which can replace the forms appropriate.

On the other hand, are identical glyphs if they have different meanings, also encodes several times, about the glyphs А, В, Е, K, М, Н, О, Р, Т and Х, which - partly different meanings - both in Latin and also occur in the Greek and Cyrillic alphabet.

In borderline cases, fought hard to make the decision, whether it's glyph, or indeed different, a separate coding worthy characters ( graphemes ) is. For example, quite a few experts think that one can look at the Phoenician alphabet as a glyph of the Hebrew alphabet, as the entire character set of the Phoenician there has clear correspondences and both languages ​​are very closely related. Ultimately enforced but finally called to the view if it were separate systems of signs, in the Unicode terminology "scripts".

The situation is different for CJK (Chinese, Japanese and Korean ): The many shapes of equivalent characters have in recent centuries diverged. Nevertheless, the language-specific glyphs share the same codes in Unicode ( with the exception of some characters for compatibility reasons ). In practice, mainly by language-specific fonts are used, so that the space requirement of the scriptures together is high. The uniform coding of CJK characters (Han Unification ) was one of the most important and comprehensive preparatory work for the development of Unicode. Especially in Japan, it is quite controversial.

When the cornerstone was laid for Unicode, had to be taken into account that already had a number of different encodings in use. Unicode-based systems should be able to handle conventionally encoded data with little effort. These 8859-1 encoding ( Latin1 ) was used for the lower 256 characters, the widely used ISO retained as well as the types of encoding various national standards, eg TIS -620 for Thai (almost identical to ISO 8859-11 ) or ISCII for Indian writings that have been moved in the original order only in higher areas.

Each character significant of traditional encodings has been adopted into the standard, even if it is the usually applied standards not just. This is in large part to characters that are composed of two or more characters, such as accented letters. Incidentally, you can still find a lot of the software does not have the ability to assemble characters with diacritics properly. The exact determination of equivalent encodings is part of the Unicode belonging comprehensive set of rules.

In addition, there are many Unicode characters that do not have associated glyph and are still treated as " characters". So next control character such as tab (U 0009 ), line feed (U 000 A), etc. 19 different characters explicitly defined solely as spaces, even those without width, inter alia, for languages ​​such as Thai, the written word without space will be used as a word separator. For bidirectional text, such as Arabic with Latin, seven formatting characters are encoded. In addition, there are other invisible characters that are to be evaluated only under certain circumstances, such as the combining grapheme joiner.

Use on computer systems

Code point input methods

Direct input at the operating system level

Microsoft Windows

On Windows (from Windows 2000 ), in some programs (specifically in RichEdit fields ) of the Code are decimal (when the Num Lock ) entered as Alt on the numeric keypad. However, please note that numbers are less than 1000 characters to complement a leading zero ( eg Alt 0234 for codepoint 23410 [ ê ] ). This measure is necessary because the ( still available in Windows) input method Alt has already been used in MS- DOS times, the characters in codepage 850 (especially in earlier MS- DOS versions also code page 437) enter. See also: Old Code.

Another input method assumes that in the Windows registry in HKEY_CURRENT_USER \ Control Panel \ Input Method EnableHexNumpad an entry exists and it is the value of 1 is assigned. Then Unicode characters can also be entered as follows: First, the (left ) Alt key is to press and hold, then press the numeric keypad plus key and let go again, and then enter the hexadecimal code of the character (the must be pressed Alt key until the very end ).

Although this input method works in principle in every input field every Windows program, but it can happen that quick keys prevent the input hexadecimal code points for menu functions: If you want to, for example, the letter Ø (U 00 D8) typing, and the combination results Alt D in many programs mean that instead the file menu opens.

Another drawback is that Windows requires here the explicit specification of the ( internally used in Windows) UTF -16 encoding instead of Unicode encoding itself and therefore only allows the input four-digit code values; for characters that are above the BMP and have code points with five or six digit Hexadezimaldarstellug, Pairs are to be used, in which a five or six digit code point is mapped to two four-digit replacement code points instead so-called surrogates. For instance, the treble clef 𝄞 (U 1 D11E ) as UTF -16 value pair U d834 U DD1E enter; direct input five or six digit code points is here not possible.

Apple

For Apple computers Unicode Hex Keys must be enabled. After the code is entered while holding down the Alt key.

Direct entry into special software

Microsoft Office

Under Microsoft Office ( Office XP ) can also be entered in hexadecimal Unicode by U is typed in the document or, then press Alt c is pressed. This key combination can also be used to indicate the code of the character before the cursor. An alternative possibility, which also works in older versions, with " Insert" - call " special characters " a table of Unicode characters in it choose a desired with the cursor and insert into the text. The program also allows to specify for frequently needed character macros, which can then be accessed with a key combination.

Qt

GTK , Qt and all programs and environments based on it ( such as the GNOME desktop environment) support the input of the Ctrl Shift or in newer versions Ctrl U or Ctrl Shift u After pressing an underlined appear small and After the Unicode can be entered in hexadecimal form and is also underlined, so you can identify what part of the Unicode. After pressing the spacebar or Enter then the corresponding character appears.

Vim

In the text editor Vim to Unicode characters with Ctrl v, followed by the u and the Unicode are in hexadecimal form, enter key.

Selection on character tables

Since Windows NT 4.0, the program is charmap.exe, called character table, built into Windows. With this program, Unicode characters can be entered via a graphical user interface to insert. It also provides an entry field for the hexadecimal code.

In Mac OS X is under Insert → Special Characters also a system-wide character palette ready.

The free programs gucharmap (for Windows and Linux / UNIX) and kcharselect (for Linux / UNIX ) represent the Unicode character set is on the screen, and provide additional information about each character.

Code-point in any document

HTML and XML support Unicode with character codes that represent regardless of the character set, the Unicode characters. The notation is for decimal notation or & # x0000; hexadecimal notation, the 0000 is the Unicode number of the character. For certain characters also named characters are defined (English named entities ), such as is ä ä represents that applies only to HTML; XML and the derived XHTML define named notations only for the characters that would be interpreted under normal use as parts of the markup language, so < as <, > as >, & as & and "as".

Fonts

Whether the corresponding Unicode character actually appears on the screen depends on whether the font used (ie a graphic for the desired character number) contains a glyph for the character you want. Often, for example, in Windows, a character from a different font, if the font used does not contain a sign, possibly inserted.

Meanwhile, the code space of Unicode / ISO has adopted an extent (more than 100,000 characters ), which no longer can be fully accommodated in a font file. Today's most popular font file formats, TrueType and OpenType can contain up to 65,536 glyphs. Unicode / ISO compliance of a font form does not mean that the entire character set is included, but only that the characters contained in it are standard coded. The most comprehensive font is split into three files Shareware Code2000 Code2002 to James Kass. However, even this is not from any Unicode character. In the publication " decodeunicode " which introduces all characters, a total of 66 fonts are called, from which the character tables are put together.

Criticism

Unicode is criticized by scientists and in East Asian countries. One of the criticisms is the Han - unification of East Asian perspective of various unrelated languages ​​are combined in this approach characters. Among other criticism is that ancient texts in Unicode can not be faithfully reproduced due to the standardization similar CJK characters. Due to its numerous alternatives to Unicode such as the Mojikyō standard were developed in Japan.

The coding of the Thai font is criticized because it unlike any other writing systems to Unicode is not based on logic, but on visual order, which considerably complicates inter alia, the sorting of Thai words. The Unicode encoding is based on the Thai standard TIS -620, which also uses the visual order. Conversely, the coding of the other Indian scripts is sometimes referred to as " too complicated ", especially by representatives of the Tamil script. The model of separate consonant and vowel signs, which Unicode has adopted the Indian Standard ISCII, is rejected by those who prefer to prefer separate code points for all possible consonant-vowel compounds. The Government of the People's Republic of China made a similar proposal to encode the Tibetan script as syllables rather than individual consonants and vowels.

107462
de