ASCII

The American Standard Code for Information Interchange (ASCII, alternative US-ASCII, often [ æski ] pronounced) is a 7 -bit character encoding; it corresponds to the U.S. variant of ISO 646 and serves as the basis for future, based on more bits for coding character sets.

The ASCII encoding was published on 17 June 1963 by the American Standards Association (ASA ) as a standard ASA X3.4 -1963 and 1967 and last updated in 1968 (ANSI X3.4 - 1968). The character encoding defines 128 characters, consisting of 33 non-printable and 95 printable characters. The latter are, starting with the spaces:

The printable characters include the Latin alphabet in upper and lower case, the ten Arabic numerals, and some punctuation. The character set corresponds largely to a keyboard or typewriter for the English language. In computers and other electronic devices that represent text, it is usually referred to ASCII or backwards compatible (ISO 8859, Unicode) stored for this purpose.

The non-printing control characters output characters such as newline or tab, log characters such as the transmission end or confirmation and delimiters such as record delimiters.

Coding

Each character is assigned a bit pattern of 7 bits. Since each bit can have two values ​​, there are 27 = 128 different bit patterns that 0-127 (00 -7F hexadecimal) can also be interpreted as integers.

In non-English languages ​​used special characters - for example, the German umlauts - are not included in the ASCII character set.

The unused for ASCII eighth bit can also for error correction purposes ( parity bit ) are used on the communication lines or for other control tasks. Today it is almost always used as the extension of a 8-bit ASCII code. These extensions are largely compatible with the original ASCII so that all characters defined in ASCII are also encoded in the various extensions by the same bit pattern. The simplest extensions are coded words with language-specific characters that are not included in the basic Latin alphabet.

History

An early form of the character encoding was the Morse code. He was replaced with the introduction of teletypes from the telegraph networks and replaced by the Baudot code and Murray code. From 5- bit Murray code for 7-bit ASCII, it was then only a small step - even ASCII was first used for certain American Telegraph models, such as the Teletype ASR33. In the early days of the computer age to ASCII became the standard code for characters. For example, many terminals ( VT100 ) and printer were controlled only with ASCII.

ASCII was originally the representation of characters of the English language. The first version, still without lowercase, and with small deviations from the current ASCII, was established in 1963. 1968 the still valid ASCII was then determined. Later In order to display special characters in other languages ​​(eg German umlauts ), replaced one little-used characters (see DIN 66003, ISO 646). Or you took ASCII for new codes with eight bits per character (eg DIN 66303, ISO 8859 ) as a compatible basis. However, also offered 8 -bit codes in which a byte stood for a sign, not enough space to accommodate all signs of human culture of writing at the same time. Thus several different specialized extensions were necessary. There are also some ASCII -compatible encodings that either switch between different code tables or need more than one byte for each non -ASCII characters, especially for the East Asian region. None of these 8 -bit extensions, however, "ASCII", because it only represents the standard 7- bit code.

For the encoding Latin characters to an ASCII incompatible encoding is used (EBCDIC ) almost exclusively on mainframe computers. The less common ROT47 algorithm applies the ROT13 encryption method known to all ASCII characters between 33 ("!" ) And 126 ("~" ) to.

Composition

The first 32 ASCII character codes ( from 0x00 to 0x1F ) are for control character (control character) reserved; see there for the explanation of the abbreviations in the table above. These are characters that are not letters, but for the control of such devices are (or were ) that use the ASCII (such as printers). Control characters are eg the carriage return to newline or Bell ( the bell ); their definition is historically justified.

Code 0x20 (SP ) is the space (English space or blank), which is used in a text as empty and delimiter between words and generated on the keyboard by pressing the spacebar.

The codes 0x21 to 0x7E represent printable characters consisting of letters, numbers and punctuation marks.

Code 0x7F (all seven bits set to one ) is a special character, which is also known as the delete character (DEL). This code was previously used as a control character to on paper tape or punched cards an already punched characters, by making all the bits, ie by Auslochen all seven markers, delete later - once existing holes just can not undo. Regions without holes (ie, code 0x00) found mainly at the beginning and end of a perforated strip ( NUL ).

For this reason, belonged to the actual ASCII only 126 characters, because the bit patterns 0 ( 0000000 ) and 127 ( 1111111 ) corresponded to no character codes. The code 0 was interpreted later in the C programming language as the "end of string"; the numeral 127 different graphic symbols have been assigned.

Extensions

ASCII does not contain diacritical marks that are used in almost all languages ​​based on the Latin alphabet. The international standard ISO 646 (1972) was the first attempt to address this problem, however, which led to compatibility problems. It is still a seven-bit code, and because no other codes were available, some codes have been used in new variants.

For instance, the ASCII position 93 for the right square bracket (]) in the German character set variant ISO 646- DE by the big U with umlaut (Ü) and in the Danish variant ISO 646- DK by the large A with ring ( Kroužek ) ( Å) replaced. When programming then had to be replaced by the corresponding national characters who used in many programming languages ​​brackets. This reduced the readability of the program text, and often led to unintentionally hilarious results by about the startup message of the Apple II of "APPLE ] [" to " APPLE ÜÄ " mutated.

Different manufacturers develop their own eight-bit code. The code page 437 called code has long been the most widely used, he came on the IBM PC under English MS -DOS, and is still in the DOS window by English Microsoft Windows used. In the German installations, since MS- DOS 3.3, the Western European code page 850 is the default.

Even in later standards such as ISO 8859, eight bits were used. Here are several types, for example ISO 8859-1 for Western European languages ​​. German -language versions of Windows (except DOS window ) use the building on ISO 8859-1 encoding Windows -1252 - hence see, for example, German umlauts from wrong when text files created under DOS and viewed under Windows.

Many older programs that used the eighth bit for their own purposes, so that could not handle. They have often been adapted to the new requirements in course of time.

To meet the requirements of different languages ​​meet, the Unicode was ( in his character set identical to ISO 10646 ) was developed. It uses up to 32 bits per character and thus could differ over four billion different characters, but is limited to about one million allowed code points. Thus, all characters previously used by humans can be played if they are included in the Unicode standard. UTF -8 is an 8- bit encoding of Unicode, which is backwards compatible with ASCII. A character can take one to four 8 -bit words. Seven -bit versions must no longer be used, but may also be encoded using Unicode UTF -7 in seven bits. UTF -8 ( 2011) develops into a uniform standard in most operating systems. Thus, inter alia, use Apple's Mac OS X and some Linux distributions default to UTF -8, and more and more web pages are created in UTF -8.

ASCII contains only a few characters that are used universally for formatting or structuring of text; these emerged from the control commands of the Telegraph. These include in particular the LF (Line Feed ), the carriage return (carriage return), horizontal tab, form feed ( form feed), and vertical tab. In typical ASCII text files can be found in addition to the printable characters usually only the carriage return or line feed to mark the end of a line; while both are used in sequence in DOS and Windows systems commonly in older Apple and Commodore computers (without Amiga ), only the carriage return and on Unix-like and Amiga systems, only the line feed. The use of additional characters for text formatting is handled differently. For formatting text rather markup languages ​​are now being used such as HTML.

Compatible character encodings

Most character encodings are designed so that they use for characters in the range 0 ... 127 the same code as ASCII and the area above 127 for more characters to use.

Codes with fixed length (selection)

Here is a fixed number of bytes for each character. In most encodings which is one byte per character - Single Byte Character Set or shortly called SBCS. At the East Asian fonts, it is two or more bytes per symbol, thus this coding are not ASCII compatible. The compatible SBCS character sets correspond to the above discussed ASCII extensions:

  • ISO 8859 with 15 different character encodings to cover all European languages ​​, Turkish, Arabic, Hebrew and Thai
  • MacRoman, MacCyrillic and other proprietary fonts for Apple Mac computer from Mac OS X
  • Windows and DOS code pages, Windows 1252
  • KOI8 -R for Russian and KOI8 -U for Ukrainian
  • ARMSCII -8 and ARMSCII -8a for Armenian
  • GEOSTD for Georgian
  • ISCII for all Indian Languages
  • TSCII for Tamil

Codes with variable length

In order to encode more characters, the characters are 0 to 127 encoded in one byte, other characters are encoded by multiple bytes with values ​​of more than 127:

  • UTF -8 for Unicode
  • Big5 for Traditional Chinese ( Republic of China, overseas Chinese )
  • EUC (Extended UNIX Coding) for several East Asian languages
  • GB ( Guojia Biaozhun ) Simplified Chinese ( PRC )

ASCII table

The ASCII table contains all the codes of the ASCII character set; see control characters for the meaning of abbreviations:

3436
de