Tokenization
Tokenization, in computer linguistics, the segmentation of a text into units of word level (sometimes sentences, paragraphs etc.).. The tokenization of the text is a prerequisite for its further processing, for example for syntactic analysis parser in text mining or information retrieval.
In computer science, the term analogous to the decomposition of a computer program written in a programming language into the smallest units, see Token ( Compiler ) and token -based compression.
Problems of tokenization
Usually, a text is divided in the tokenization in his words. The white space tokenization is the simplest form of such a decomposition. The text is separated in this process to the spaces and punctuation characters. For non - segmentisierenden writings such as the Chinese or Japanese, it can not be applied because there are no spaces available in these.
In an alternative Tokenisierungsverfahren sequences of letters form a token, as well as all sequences of digits. All other characters constitute in itself a token.
However, both methods are problematic in the case of Mehrwortlexemen, especially proper names, currency and so on. For the record Klaus -Rüdiger bought in New York for $ 2.50 Fish'n'Chips would be from a linguistic point of view, a segmentation in the following sequence of tokens adequate:
Klaus -Rüdiger buys in New York for $ 2.50 fish'n'Chips literature
- Kai -Uwe Carstensen, Christian Ebert, Cornelia Ebert, Susanne Jekat, Ralf Klabunde, Hagen Langer: computational linguistics and language technology. An Introduction. 3rd edition. Spektrum Akademischer Verlag, Heidelberg, 2010, ISBN 9783827420237, pp. 264-271
- Computational Linguistics
- Indexing