Regular expression

A regular expression (English regular expressions, abbreviated as regex or regexp ) is in theoretical computer science, a string that is used to describe of sets of strings with the help of certain syntactic rules. Regular expressions are mainly used in software development using. In addition to implementations in many programming languages ​​, many text editors have regular expressions in the function ' search and replace '. A simple application of regular expressions are wildcards.

Regular expressions can be used as filter criteria in the text search by the text is matched against the regular expression pattern. This process is also called pattern matching. For example, it is possible to pick out all the words from a list of words that start with S and end at D - without having the intermediate characters or the number specified explicitly.

The term of the regular expression is mainly due to the mathematician Stephen Kleene. This used an almost identical notation that he "regular sets" called.

  • 3.1 Character literals
  • 3.2 A character from a selection
  • 3.3 Predefined character classes
  • 3.4 quantifiers
  • 3.5 Possessives behavior
  • 3.6 groups and backreferences
  • 3.7 alternatives
  • 3.8 Other characters
  • 3.9 Look -around assertions
  • 3:10 Conditional Expressions

Regular expressions in theoretical computer science

Theoretical foundations

Regular expressions describe a family of formal languages ​​and belong to theoretical computer science. These languages ​​, the regular languages ​​, are located on the bottom and thus weakest expression level of the Chomsky hierarchy (namely, type -3). They are generated by regular grammars.

For every regular expression, there exists a finite automaton that accepts the language specified by the expression. This finite state machine can be constructed starting from regular expressions. It follows from the relatively simple implementability of regular expressions. Conversely, there exists for every finite automaton is a regular expression that describes the language accepted by the machine.

Syntax

The syntax defines exactly how regular expressions look like.

Regular expressions are always defined over a given character set, called the alphabet. Regular expressions are based on exactly three operations: alternative, concatenation and repetition. The formal definition is as follows:

For an alternative, the icon will be used instead. It then writes. For the concatenation ( concatenating ) there is also use an operator symbol; to write then.

You can allow additional constants and operations, provided that their effect could be described with the principles mentioned above. Thus one finds in the literature, among other things as a regular expression or the positive Kleene shell, which can be regarded as an abbreviation of.

Are you a priority of operators, one can dispense with some brackets. The ranking is usually Kleene star before concatenation before alternative. Then, instead of the notation is sufficient.

The number of nested * operators is called the astrolabe.

Semantics

The semantics of regular expressions defined exactly which formal significance of the regular expression syntax.

A regular expression describes a formal language, ie a set of words ( strings). The definition of the semantics can be analogous to describe the syntax definition. It refers to the formal language that is specified by the regular expression.

Whose meaning contains the syntax definition of regular expressions and the constant, so is defined as, ie the language containing only the empty word.

The empty word is a word of a formal language () and thus not a regular expression. The language containing only the empty word, but can be even without the constant by a regular expression describing, for example:. However, it is not always distinguished visually between a regular expression and the associated language, so that one used instead as a regular expression for the language, just as the distinction between and, and between and omitted.

Examples

If the alphabet of the letters, and there is so, then let the following languages ​​with the corresponding regular expressions describe:

  • The language of all words that consists of any number or any number of: Syntax:. semantics:
  • The language of all words that begin with: Syntax:. semantics:
  • The language of all words that begin and end with and intervening only consist of (or anything in between ): Syntax:. semantics:
  • The language of all words that consist of two characters, but not included: Syntax:. semantics:

Application of regular expressions

Ken Thompson used this notation in the 1960s to build qed ( a previous version of the Unix editor ed) and later the tool to write grep. Since then implement many programs and libraries of programming functions to use regular expressions to search and replace strings. Examples are the programs sed, grep, emacs and libraries of programming languages ​​Perl, C, Java, Python, PHP, Ruby and. NET Framework. Also, the word processor and the spreadsheet of the office suite OpenOffice.org offers the ability to search using regular expressions in the text.

Between different regexp implementations there are differences in functionality and syntax. In programming languages ​​, the Perl Compatible Regular Expressions ( PCRE ) have predominantly enforced, which are based on the implementation in Perl 5.0. In addition, a distinction is made between POSIX.2 " basic " regular expressions (basic regular expressions ) and "extended" regular expressions (extended regular expressions ).

Some programs, such as the text editor Vim, offer the possibility to switch back and forth between different regexp syntaxes.

Regular expressions play an important role in the lexical analysis of source code, such as compilers or syntax highlighting in editors. A lexical scanner parses the source code using regular expressions in so-called tokens ( keywords, operators, ...). Since they are context-free languages ​​with most programming, regular expressions are not powerful enough to describe their syntax. Therefore, the following compilers in syntactic analysis is usually by a separate program, the parser does.

Regular expressions in practice

Most current implementations support extensions such as backreferences ( back references ). This is not about regular expressions in the sense of theoretical computer science, because so advanced expressions describe not necessarily languages ​​of type 3 of the Chomsky hierarchy.

The following syntax descriptions refer to the syntax of the current implementations of extensions, so they only partially corresponding to the above definition of the theoretical computer science.

A common use of regular expressions is to find specific strings in a set of strings. The description given below is an (often used ) Convention in order to implement concepts such as character class, quantification, link and summarize concrete. Here is a regular expression from the character of the underlying alphabet in combination with the metacharacters [] () {} |? - * ^ $ \. (partly depending on the context ) is formed. The meta- property of a character can be lifted by a preceding backslash character. All other characters of the alphabet stand for themselves

Character literals

Those characters (literally, literal ) match directly are also listed directly. Depending on the system, there are also ways the character by octal or hexadecimal ( \ ooo and \ xhh ) or the hexadecimal Unicode position ( \ uhhhh ) indicated.

A character from a selection

With square brackets can be a character selection define ( [and]). The expression in square brackets is then available for a single character from this selection. Within this character class definitions, some symbols have different meanings than in the normal context. Sometimes the meaning of a symbol from the context -dependent, in which it occurs inside the parentheses.

For example, a circumflex " ^ " at the beginning of a character class definition is that the character class is negated / inverted ( in the sense of complementation ). If a circumflex but somewhere else in the definition, it is literal to understand. Also, depending on the context, the meaning of the hyphen character ( - ). In addition, here the regexp engines (for example, POSIX and PCRE ) differ in some respects from each other. If a hyphen - between two characters in the class definition, for example " [ ag ] ", it is to be understood as bis- line, that is, with respect to the ASCII table as a description of a character or character range interval. The above example would be equivalent to " [ abcdefg ] ". At the beginning or end of a character class standing hyphens be interpreted literal.

Predefined character classes

There are predefined character classes that are not supported by all implementations in the same way, because they are simply short forms and can also be described by a character selection. Important character classes are:

A dot (. ) Means that in its place a ( almost) any character can stand. Most RegExp implementations do not see default newline ( line break ) as any character to, but may in some programs using the so-called single-line Modifiers s ( for example in / foo.bar / s work) is to be achieved.

In many newer implementations and classes can be specified within the square brackets after POSIX, which themselves contain square brackets. They read, for example:

  • [: cntrl: ] - control characters. In ASCII, the characters are the 00 -1F and 7F (DEL).
  • [: print:] - Printable characters: [: alnum :], [: punct: ] and space [: space:] - Whitespace: Horizontal and vertical tab, line and form feed, carriage return and LeerzeichenZK1 [: blank:] - space or tab
  • [: punct: ] - Punctuation marks such as: . "# $% & '() * , - /:; < => @ [\ ] ^ _ ` {| } ~?
  • [: alnum: ] - Alphanumeric characters: [: alpha :] or [: digit:] [: xdigit: ] - Hexadecimal digits: 0 to 9, A to F, a to f [: digit:] - The digits 0 to 9
  • [: lower:] - KleinbuchstabenZK2: not necessarily only from a to z
  • [: upper:] - GroßbuchstabenZK2: not necessarily only from A to Z

Quantifiers

Quantifiers (English quantifier, also quantifier or repetition factors ) allow it to admit the previous expression in different multiplicities in the string.

The quantifiers relate to the previous regular expression, but not necessarily to the found him by coincidence. Thus, although, for example, by a "a" or represent " aaaa ", but corresponds to [0-9 ] is not only repeating the same numbers, but also consequences of mixed numbers, such as " 072 345 ".

Other examples are:

  • " [Ab ] " matches " a", " b", " aa ", " bbaab " etc.
  • " [0-9 ] { 2,5 }" corresponds to two, three, four or five digits in sequence, such as " 42 " or " 54072 ", but not the character strings " 0", " 1.1 " or " A1A1 ".

If a string only from the given pattern exist ( and it does not only contain ), it must be explicitly defined in most implementations that the pattern from the beginning ( \ A or ^ ) QF1 to the end of the string ( \ Z, \ z or $ ) to QF1 rich. Otherwise, for example, [0-9 ] { 2,5 } detects even with the string " 1234507 ", the sub-string "12345". For the same reason a example, would * always result in a hit, because each string - even the empty string "" - at least 0 times the character "a " contains.

Quantifiers are implemented by default " greedy" (English greedy ). That is, a regular expression resolves to the maximum match. However, since this behavior is not always as intended, can be in many recent implementations quantifiers as " modest " or " restrained " (English non-greedy, reluctant ) declare. For example, in Perl this the quantifier a question mark? readjusted. The implementation of frugal quantifiers is comparatively expensive ( requires backtracking ), which is why not all implementations support this.

Possessives behavior

A variant of the greedy behavior described above is the possessive matching. Since in this case, however, the backtracking is prevented, even matching characters will not be released again. Because of this can be found in the literature, the synonymous names of atomic grouping, independent subexpression or non- backtracking subpattern. The syntax for these constructs varies for the different programming languages. Originally such subexpressions ( subpattern ) were formulated in Perl by (? > Expression). In addition, since Perl 5.10 there are the equivalent, already available in Java possessive quantifiers , *, ,? And { min, max } .

Groups and backreferences

Expressions can be used with round brackets ( and ) summarize: Approximately allowed " (abc ) " a "abc" or " abcabc " etc.

Some implementations save from the found matches of groups and allow their reuse in the regular expression or the replacement text. These are called backreferences (English back references ). For this, the notation \ n or $ n is often used, where n is the compliance of the n-th grouping corresponds. A special position in this case represents n = 0, which is mostly for the compliance of the entire regular expression.

Interpreter of regular expressions that allow backreferences no longer conform to the type 3 of the Chomsky hierarchy. The pumping lemma can be shown that a regular expression that determines whether a string before and after 1, the same number of 0, not a regular language.

In addition there are also groups that do not create backreference (English non- capturing ). The syntax for this is in most implementations (? ... ). Regexp documentation indicate that the generation of backward references should always be avoided if no subsequent access successes on it. For the generation of the reference costs execution time and occupies space to store the match is found. In addition, the implementations allow only a limited number of backward references to (often just a maximum of 9 ).

The regular expression "\ d (: - \ d ) * " can be found sequences of numbers separated by dashes, without obtaining the last separated by a dash number sequence as a back reference.

Alternatives

You can alternate expressions with the | allow - "" symbol.

Other characters

To support the often related to character strings applications on the computer, the following characters are in addition to those already mentioned, usually defined as:

Look -around assertions

Perl version 5 resulted in addition to the usual regular expressions and look-ahead and look -behind assertions (such as " forward-looking" and " backward -looking" assumptions / assertions ) a, which is summarized under the term look -around assertions. These constructs extend the regular expressions for the opportunity to formulate context-sensitive conditions without match on the context itself. That is, you want to match on all strings "sport", where the string " simplistic " follows, but without the matched string contains the string " simplistic " self, this would be a look-ahead assertion possible: Sport ( = simplistic? ). In the example sentence. " An athlete runs sport sports club " would that regular expression match on so only the last occurrence of "sport", as only this " simplistic " the string follows; he would not match on the entire substring " sports club ".

Due to the property that the specified context ( in the example " simplistic " ) is specified but not an explicit part of the matched string (here " Sport" ), is associated with assertions usually the attribute is zero-width mitgenannt. The proper names so loud - depending on whether a certain context required (positive) or prohibited ( negative) - zero-width positive / negative look-ahead/behind assertions. The names of the directions stem from the fact that Regexp parser process a string from left to right.

Look- arounds are supported not only of Perl and PCRE, but among other things, of Java,. NET and Python. Some text editors such as Vim offer the possibility, albeit with some other syntax.

Conditional Expressions

Relatively little common are conditional expressions. This can be used, among other things in Perl, PCRE and. NET Framework. Python provides for such expressions associated with look -around assertions only limited functionality.

528646
de