Tokenization

Tokenization, in computer linguistics, the segmentation of a text into units of word level (sometimes sentences, paragraphs etc.).. The tokenization of the text is a prerequisite for its further processing, for example for syntactic analysis parser in text mining or information retrieval.

In computer science, the term analogous to the decomposition of a computer program written in a programming language into the smallest units, see Token ( Compiler ) and token -based compression.

Problems of tokenization

Usually, a text is divided in the tokenization in his words. The white space tokenization is the simplest form of such a decomposition. The text is separated in this process to the spaces and punctuation characters. For non - segmentisierenden writings such as the Chinese or Japanese, it can not be applied because there are no spaces available in these.

In an alternative Tokenisierungsverfahren sequences of letters form a token, as well as all sequences of digits. All other characters constitute in itself a token.

However, both methods are problematic in the case of Mehrwortlexemen, especially proper names, currency and so on. For the record Klaus -Rüdiger bought in New York for $ 2.50 Fish'n'Chips would be from a linguistic point of view, a segmentation in the following sequence of tokens adequate:

Klaus -Rüdiger   buys   in   New York   for   $ 2.50   fish'n'Chips literature

  • Kai -Uwe Carstensen, Christian Ebert, Cornelia Ebert, Susanne Jekat, Ralf Klabunde, Hagen Langer: computational linguistics and language technology. An Introduction. 3rd edition. Spektrum Akademischer Verlag, Heidelberg, 2010, ISBN 9783827420237, pp. 264-271
  • Computational Linguistics
  • Indexing
777884
de