Lexical analysis

A tokenizer (also: lexical scanner, short Lexer ) is a computer program for decomposition of Plain text (eg source code) into sequences of logically related units called token (English tokens ). As such, it is often part of a compiler.

Basics

In the decomposition of an input into a series of logically related units, in the so-called token, it is also called lexical analysis. Typically, the decomposition is done according to the rules of a regular grammar, and the tokenizer is implemented as a finite state machine. A process for converting a regular expression into a finite automaton is the Berry - Sethi process.

A tokenizer is part of a parser and preprocessing function. He recognizes within the input keywords, identifiers, operators, and constants. These consist of multiple characters, but form a logical unit, called tokens. Detected tokens are returned to their respective type. Tokens are the atomic units, so to speak, he must be able to process and are therefore also called terminal symbols for the parser.

A tokenizer can use a separate, dedicated screeners to remove white space and comments as well as simplify the lexigrafische analysis of the input data. However, this must be covered by the underlying grammar.

Using extended Backus -Naur Form ( EBNF ), a tokenizer be formally specified.

A typical example of a specification language or a tokenizer - generator is Lex ( computer science ), with the tokenizer of a corresponding C code can be generated.

Programs to generate

If you can give a formal description of the recognizable vocabulary, a tokenizer can be generated automatically. The contained in Unix operating systems Lex program and developed as free software Flex exactly fulfill this function. From the formal description of the programs to generate a function, which determines the respective next token from an input text and returns. This function is then usually in a parser use. See also parser.

510057
de