Stemming

As Stemming ( stem-form reduction, normal form reduction ) is called in information retrieval, as well as in linguistic computer science, a process, be attributed to the different morphological variants of a word to their common root word, such as the declination of the word or words word and conjugation of seen or watched seh.

History

In 1968 Julie Beth Lovins published the first known stemming algorithm. This algorithm had a great impact on the further development of stemming algorithms. A later Stemmer was published in 1980 by Martin Porter. This Stemmer became the de facto standard for stemming English texts. Porter received in 2000 the Tony Kent Strix Award for his work in the field of stemming algorithms and information retrieval.

There have been written many implementations of the Porter Stemmer algorithm and distributed free of charge, of which, however, contained many small errors. This meant that these Stemmer could never skim their full potential. In order to eliminate this source of error, Porter published around the year 2000 an official implementation of the algorithm. In the following years he expanded his work by creating a framework for writing stemming algorithms with Snowball. In addition, he created an improved Stemmer for English language together with Stemmern for other languages.

Stemming process

There are various stemming algorithms for various languages. The development of a mortiser is an experimental science, since algorithms can not be verified, but must be tested only on text corpora and in practice.

Examples:

  • Porter Stemmer algorithm
  • KSTEM (Robert Krovetz: Viewing morphology as to inference process, 1993)
  • N-gram method
  • Lexicon -based stemming ( lemmatization )
  • Corpus- based stemming
  • Statistical methods
  • Computational linguistic methods.

An alternative, much simpler and less accurate way is to search for partial strings, for example, with the star operator. This is also referred to as truncation.

Comments

In contrast to search, for example with regular expressions, which in large databases for Search - eg search engines - would be too slow, a lot of texts is uniquely indexed to be searched them quickly.

In some languages ​​the word decomposition and composition plays ( ⇒ ran away run away ) play an important role.

283357
de