Part-of-speech Tagging

Under part-of -speech tagging refers to the assignment of words and punctuation of a text to speech (English part of speech ). To this end, both the definition of the word and the context ( eg adjacent adjectives or nouns ) are considered.

Method

The detection and identification of the parts of speech was originally performed manually, in course of time the process has become increasingly automated by computer linguistics. The methods used can be divided into supervised machine learning and unsupervised machine learning. In supervised learning method or decision trees can be used ( by Helmut Schmid), for example, Hidden Markov Models or Eric Brill, and every part of speech tags come from a so-called predefined tag set. For the German, the Stuttgart -Tübingen - Tagset ( STTS ) is often used. In unsupervised learning, the tag set is not fixed in advance, but it is produced by a stochastic process.

Principle

The set Petra reading a long novel. is tagged with the Stuttgart -Tübingen - Tagset as follows:

Behind each word or punctuation mark is the day after a slash. In order to tag a word in the given context correctly, you have to differ from the forms of the homonymous verb it; this would be tagged with VVINF (for the infinitive ) or VVFIN (for the finite form).

In supervised learning, the day is selected for a with the help of the context: From an already tagged corpus, for example, the probabilities were previously used for day - Follow VVFIN STYLE, VVFIN - VVINF and VVFIN - VVFIN calculated (so-called training of the tagger ). Since VVFIN -ART is significantly more likely than the other two episodes, one is tagged in this sentence as ART. ( The frequent consequence can read is not tagged with VVFIN - VVINF, but with VMFIN - VVINF. )

In unsupervised learning, there is no prior training, but from the records to taggenden itself is calculated, for example, that one often by reading or reading stands, but also often at end of block. Frequently the contrast is by reading or reading, but never or rarely at the end of a sentence. Reading is often at the end of the block and never after reading or read. Therefore, the tagger produces a part of speech, for example, to the part, and another that contains read. A belongs to both types of words. That it should be tagged as the in the given set, is given by the same reasoning as for the tagger, which was trained using supervised learning.

634598
de