Chomsky hierarchy

Chomsky hierarchy, occasionally Chomsky - Schützenberger hierarchy ( named after the linguist Noam Chomsky and the mathematician Marcel Schützenberger ), is a term used in theoretical computer science. It is a hierarchy of classes of formal grammars that generate formal languages ​​, and was first described in 1956 by Noam Chomsky. The hierarchy levels differ in how rigid are the constraints on the allowable form of production rules to the current stage; in type -0- grammars are unrestricted, restricted progressively stronger at higher levels.

Grammars lower production type are more powerful than the higher types. A language generated by a grammar of type k, ie, a language of the type k addition to the Chomsky hierarchy of grammars occurs in this sense, Chomsky hierarchy of languages.

  • 2.2.1 Definition
  • 2.2.2 languages ​​generated by type -1 grammars
  • 2.2.3 Monotone grammars
  • 2.3.1 Definition
  • 2.3.2 languages ​​generated by type -2 grammars
  • 2.4.1 Definition
  • 2.4.2 languages ​​generated by type-3 grammars


Formal languages ​​have the advantage of being able to be analyzed mathematically exact. Formal languages ​​is a given alphabet ( a character set) based, which is often referred to. Any long finite sequences of elements of the alphabet are referred to as words over the alphabet. Amounts of such words are finally formal languages. The most comprehensive formal language over a given alphabet is infinite; it contains all the words that can be formed by any assembly of characters of the alphabet. They can be described by the Kleene hull of the alphabet, ie. The smallest formal language contains no words. Other formal languages ​​are composed of few words and can thus be noted down for a finite set; for example for the alphabet, the language of all words of length two.

Infinite languages ​​can be understandably not quoted by enumerating. In order to describe them accurately, an approach is needed that can also define infinite sets. This was primarily production methods are used in the context of formal languages ​​.

Assuming an existing word over an alphabet, can be specified by semi-Thue systems, sets of words, which result from any repeated application of rewrite rules. Substitution rules of the form shall require that, this segment is replaced by a word that contains a segment. Semi-Thue systems therefore give a derivation relation between words of formal languages ​​: Two words are related, if that can be derived by a substitution rule from the other word a word.

For the description of formal languages ​​to formal grammars, an opposite semi-Thue systems expanded and more powerful concept suitable. They differ from semi-Thue systems is that they know so-called terminal symbols and nonterminal symbols. Terminal symbols of a grammar are all just a sign of their character set. The first applicable rule is always of a non-terminal start symbol, and the result of a substitution process is only considered word of their formal language, if it contains no non- terminal symbols.

The following example describes a language that can be expressed with which arbitrarily long sums of the numbers 1, 2 and 3. The alphabet is appropriate for it. The corresponding terminal symbols are underlined, nonterminals in italics. The following lines represent the rules

The start symbol of this language is total. On this basis, one can now apply any rules of succession to produce a valid expression:

Not all infinite languages ​​can be described as formal languages ​​with this generation principle, but it is not a concept known tauglicheres. Another common formalism for the description of languages ​​are automata, Turing machines especially. Unrestricted formal grammars and Turing machines are equally powerful in the description of formal languages ​​, ie to each generated by a formal grammar language there is a Turing machine that accepts exactly this, and vice versa.

Just as different automata models have been defined, Chomsky defined in his work different types of grammars. Through three consecutively harsher restrictions to the respective permissible in replacement rules he set up a hierarchy of four classes of formal grammars, whereby the less restricted classes include the more regulated classes each genuine. The same applies for voice classes described by the respective grammar classes also form a hierarchy.

Languages ​​more restricted grammar type are usually easier to handle algorithmically - at the price to be less expressive powerful. Regular expressions with which we define as pattern for the search in texts correspond, for example, the very limited Chomsky grammars of type 3 ( regular grammars ) and such text searches are fast and effective. In contrast, grammars of type 3 are not suited for the description of programming languages, for which one uses usually less restricted type -2 grammars because of their simplicity.

The hierarchy

The Chomsky hierarchy consists of four types of formal grammars, numbered from 0 to 3 Type 0 includes all formal grammars in general, without restriction on the shape of allowed production rules. Grammars of type 1 to type 3 are then restricted increasingly stronger. Alone by definition is a grammar of type 1 and type 0, a grammar of type 2 and type 1, a grammar of type 3 and type 2; inversions do not apply. Therefore, these classes form a true hierarchy. In the corresponding language classes show counter-examples, that also does not apply to the reversals, which is why they form a true hierarchy.

Chomsky called for type 1, type 2 and type 3 grammars that the right-hand side of production rules must not be shorter than the left side, which also excludes the empty word on each right-hand side of a production rule. Today it is often less restrictive; the definitions are often so broad that the hierarchy of languages ​​is not disturbed, but they also make it possible grammars of type 1 ( context-sensitive grammars ) by an exception rule to generate the empty word, and type 2 ( context-free grammars ) and 3 ( regular grammars ) even allow the empty word as the right-hand side of rewrite rules without restriction.

Type 0 grammar ( Chomsky grammar or general phrase structure grammar)


Type 0 grammars are also called unrestricted grammars. This is the class of all formal grammars of the form, with a vocabulary consisting of a finite alphabet and a set of nonterminals. The excellent symbol is called the start symbol and a set of production rules, ie on the left hand side of each production rule is at least one non - terminal symbol. For individual rules to write instead of mostly.

To express the belonging to the class of type -0- grammars, one writes

Languages ​​generated by type -0- grammars

Each type 0 grammar generates a language which can be accepted by a Turing machine and vice versa exists for each language that can be accepted by a Turing machine, a type -0 grammar that generates this language. These languages ​​are also known as the recursively enumerable languages ​​(often semi- decidable languages ​​called ), that is, the associated Turing machine halts on every word that is in the language, and returns the result 1 However, it is not required, that the Turing machine stops for a word that is not in the language with the result 0 ( ie, the machine must not terminate ).

Note that this set of languages ​​( often called decidable languages) on the amount of recursive languages ​​is different, which can be decided by Turing machines.

Type 1 grammar ( context-sensitive grammar)


Type 1 grammars are called context-sensitive grammars. It involves type -0- grammars in which all production rules have the form or, where a nonterminal and words are composed of terminals and nonterminals. The words and may be empty but has at least one symbol (ie, a terminal or a nonterminal ) included. This property is called length limitations.

The only exception to this requirement can be obtained when the start symbol never occurs on the right side of a production. Thus, it is achieved that the context-sensitive languages ​​can also contain the often desired empty word then but always arises only in a one step derivation from the start symbol itself, and without disturbing for all other derivations the property that in each sub-step the set form of longer becomes or remains the same length.

Is a grammar context-sensitive, so you write.

In contrast to context-free grammars can on the left side of the productions of context-sensitive grammars quite instead of individual symbols all symbol sequences are available, provided that it contains at least one nonterminal symbol.

Languages ​​generated by type -1 grammars

The context-sensitive grammars generate exactly the context-sensitive languages ​​, that is, each type 1 grammar generates a context-sensitive language and to each context-sensitive language, there is a type 1 grammar that generates this.

The context-sensitive languages ​​are exactly the languages ​​that can be recognized by a nondeterministic linear bounded Turing machine; that is of a non-deterministic Turing machine whose tape is linearly bounded by the length of the input ( ie there is a constant number, so that the tape of the Turing machine has at most fields, the length of the input word is ).

Monotone grammars

Some authors already then describe a grammar as a context sensitive if ( so ) all production rules to meet the exemption only the condition, ie that the right side of such a production is not shorter than the left side. More often you will find it, however, the notion of monotonic grammar or the non -shortening grammar.

It turns out that the monotone grammar accurately reflects generate the context-sensitive languages ​​, and the two classes of grammars are considered equivalent and some authors treat only one or the other grammar class at all. But not every monotone replacement rule is also context-sensitive, so not every monotone grammar is a context-sensitive.

Type 2 grammar ( context-free grammar)


Type 2 grammars are called context-free grammars. There are type 1 grammars, should apply to the: In every rule of grammar that is a non- terminal symbol must be on the left just standing and on the right side can be any non- empty sequence of terminal and non- terminal symbols are.

It can also be admitted as in type 1 grammars, the exclusion rule, if not find on any right-hand side of a rule.

Context-free grammars are often defined so that the right side can be empty too, so. Such grammars no longer satisfy all properties of type 2 grammars and would no longer be in the hierarchy defined by Chomsky. But you meet the conditions of monotone grammars.

We write

Languages ​​generated by type -2 grammars

Context-free grammars ( with exception rule ) produce exactly the context-free languages ​​, so each type 2 grammar generates a context-free language and to each context-free language, there is a Type -2 grammar that generates this.

The context-free languages ​​are exactly the languages ​​accepted by a nondeterministic pushdown automaton ( NPDA ) can be detected. A subset of these languages ​​is the theoretical basis for the syntax of the most programming.

See also: Backus -Naur Form.

Type 3 grammar ( regular grammar )


Type 3 grammars are also called regular grammars. It is type 2 grammars, where on the right side of productions exactly one terminal symbol may occur and a maximum of a further non-terminal symbol. The allowed position of such nonterminal symbols must also be over all productions be harmonized always before or always behind the terminal symbol, depending on whether one speaks specifically from the left regular and right- regular grammars. They agree with the left - or right - linear grammars, whereas linear grammars do not meet the regular grammars.

For the left regular type-3 grammars so the condition must be satisfied that


You may only include the left regular productions ( non-terminal symbol on the right side in the front position).

Usually one allows for regular grammars, as well as for context-free grammars rules with empty right-hand side, ie.

Contrast, legal Regular grammars satisfy the analogous condition


So this only contain quite regular productions ( non-terminal symbol on the right side if need be in rear position). This condition is also expressed already from the permission empty right sides.

Links Regular and fairly regular grammars generate exactly the same language class, so there's every left regular grammar is also a regular right, generates the same language, and vice versa.

Note that the occurrence of the left regular and right- regular productions in the same Chomsky grammar this is not mentioned regularly. The grammars with both left- regular and right- regular productions namely generate a larger real language class.

Some authors allow the definitions of regular / regular left / right regular grammars, wherever, only a single non-terminal symbol may be in productions, also any non-empty terminal string. This has no effect on the generative power of the classes.

One writes for regular grammars.

Languages ​​generated by type-3 grammars

Regular grammars generate precisely the regular languages ​​, that is, each type-3 grammar generates a regular language and at any regular language there exists a type-3 grammar that generates this.

Regular languages ​​can be alternatively described by regular expressions and regular languages ​​are exactly the languages ​​that can be recognized by finite automata. They are often used to define search pattern or the lexical structure of programming languages.


The following table lists the four types of grammars which form their rules, what name have the languages ​​generated and accurately identify which types of machines these languages. In addition, it is noted against which operations the languages ​​generated are completed.

Is allowed if there is no rule in.

Emptiness problem finiteness problem

Only the left or right only regular productions

Emptiness problem, equivalence problem, ambiguity, finiteness problem

  • Complementation ...
  • Concatenation ...
  • Intersection of ...
  • Union ...
  • Klee ... Escher completion

Chomsky hierarchy of formal languages

A formal language is of the type according to the hierarchy of grammars, if of a type - is generated grammar. Formally, this means that is of type if there is a grammar. We write then

Is a proper subset relationship in the Chomsky hierarchy of formal languages ​​from one language sets of contiguous levels. Each context-sensitive language is recursively enumerable, recursively enumerable but there are languages ​​that are not context-sensitive. Likewise, any context-free language is also context-sensitive, but not vice versa, and every regular language is context- free but not every context-free language is regular.

Formally this means for the classes of languages ​​generated by the above grammars: where occasionally uses the following symbols:

Examples of languages ​​in the differential quantities are:

  • Is of the type 1 but not type 2
  • Is of type 2, but not type 3

Evidence for the non-affiliation of certain languages ​​to language classes and how this often led to the set of loops.

Natural languages

Although Chomsky 's research with the aim pursued, to find a mathematical description of natural languages ​​, to date the evidence of a correct and complete formal grammar for any natural language has been successful. The problem is, inter alia, in the interaction of the various parts of grammar that model each individual linguistic phenomena. In the practical use of formal grammars in computational linguistics ambiguities are added at different levels of language assessment; these must be resolved by context (eg in machine translation ).