Script (Unicode)

As a writing system (English script) is in a group called the Unicode characters that are used together as writing. In most cases the writing systems roughly correspond to the Unicode blocks, but there are writing systems that are distributed over several blocks and include the blocks of characters from different writing systems. Writing systems are independent of languages. While there are cases in which corresponding writing system and language, but many writing systems are used for writing several different languages. Thus the Latin alphabet in German, English, French, Vietnamese, and many other languages ​​is used as the font. Conversely, use multiple fonts a language. Thus, the Turkish was previously written in the Arabic script, while today the Latin alphabet used. Whether two fonts belong to a common writing system or not, is not always clearly define. Seen in Unicode Japanese Kanji as a simple variant of Chinese characters, and combines them in the course of the Han unification together with them. The Coptic alphabet was originally viewed as an extension of the Greek and later coded as an independent writing system in Unicode. In Unicode 6.3 total of 100 different writing systems are encoded.

Formal definition

Is formally defined the writing system to which it belongs a sign by two properties. In most cases the script property provides the necessary information, it is called the English name of the writing system. There are three distinct values:

  • Unknown featuring characters whose font system can not be determined. This relates not only to unassigned code points and characters from the range for private use.
  • Inherited ( 523 characters) identifies mainly combining characters. These are encoded by appearance not after use. So acute is used with both Latin and Greek letters. In determining the system of writing such characters take on the value of the preceding character.
  • Common ( 6418 characters) finally called characters that can be used in multiple writing systems. While some of these characters are used only in a few related systems of writing, characters for punctuation and symbols with all writing systems can be used.

A more detailed specification does in some cases the Script_Extensions property. In character with the value Inherited or common, which are only used in a few writing systems, it counts on these writing systems.

Use

The script property can be used in various ways. It can be used to identify the font with which a text is written, or find words from a particular font in a document. To this end, allow some implementations of regular expressions using Unicode properties.

Another application is in the defense against spoofing attacks. How to recognize a browser based on this property that, in the www.unicоde.org о warn no Latin, but a capital letter, and the user of a URL spoofing attempt.

List

The following list outlines all writing systems, which are represented in Unicode 6.3 with at least 100 characters.

Swell

  • Mark Davis, Ken Whistler: Unicode Standard Annex # 24: Unicode Script Property. ( Online)
  • Julie D. Allen et al.: The Unicode Standard. Version 6.2 - Core Specification. The Unicode Consortium, Mountain View, CA, in 2012. ISBN 978-1-936213-07-8. Chapter 6.1: Writing Systems. (on-line PDF)
  • Scripts.txt, ScriptExtensions.txt (Unicode 6.3)
716935
de