Classification in machine learning

A classifier ( computer science ) is classifies ( eg, documents ) according to their characteristics into predetermined categories, an algorithm of the objects. The term " classifier " is usually used specifically for such algorithms, in which precedes the classification of objects, a learning phase ( "Training" ). This methodology is used, for example in web mining, bioinformatics, and robotics often used.

5.1 Hubs & Authorities
5.2 Page Rank
5.3 Hypertext Classification
5.4 Hyperlink ensembles

Train

In the learning phase, the classifier is provided a set of training data. The following are the learning scenarios:

Supervised learning ( supervised learning ): For each training example, the categorization is known.
Semi - supervised learning ( semi- supervised learning ): Only part of the training examples is categorized. Among the algorithmic approaches for this scenario can be found ( among many others, some more sophisticated ideas): Self-training (self - training): The classifier assigns the examples of a self.
Co - Training: Two classifiers assign each one of its examples.
Multi -view learning: Two or more classifiers that were trained differently, assign each one of its examples.

Learning methods

Classifiers can learn in very different ways and utilize their training examples. The choice of the algorithm is dependent on the type of the training data, as well as the available power reserves. The following learning methods are used:

Decision trees and regression trees
Inductive logic programming
Neural Networks
Support Vector Machine
Clustering, for example, k-means clustering ( unsupervised )
Genetic Algorithms
KNN classifier (supervised )
Rocchio classifier
Statistical modeling eg Naive Bayes classifier
Rule-based classifier (raster analysis)

Other approaches to learning, for example,

Learning as Search
Generalization with Bias ( The learner has certain " preferences ", attracts classes before )
Divide and conquer strategies
Overfitting - avoidance: The model generalizes to reduce the risk of errors (eg: Ockham's Razor )

Evaluation

For the evaluation of the learned model following methods are used:

Validation by experts
Validerung on data
Online validation

The Valdierung data has as disadvantage that training data for validation are "wasted" ( out- of-sample testing). This dilemma can by cross - validation ( cross-validation ) are dissolved: One divides the data into n parts. Then one uses for each partition p, the other n -1 partitions for learning and the partition p for testing.

To evaluate a classifier three sizes are standardly given based on the following table:

Hit accuracy ( Accuracy)

The hit accuracy is the proportion of correctly classified examples.

Memory ( recall)

The memory is the proportion of positive examples that were classified positive.

Precision (Precision )

The precision indicates the proportion of positive classified examples, which are actually positive.

Additional properties (Feature Engineering )

The approaches to the extension of classifiers are broken down as follows:

Text -dependent properties n-grams: exploiting the context by using sequences of length n instead of individual words ( thereby distinguishing such as " mining" in " web mining " and " coal mining " )
Position information

Traced back to its root word inflected verbs: Stemming
Noun phrases: Focus of n-grams to " real" sets limit ( for example, only bigrams with Noun - Noun - Noun and adverb )
Linguistic phrases: Find all occurrences of a syntactic templates ( eg " noun - verb aux " finds " I am " )

Structural markup
Hypertexts

Frequency based
TF - IDF ( unsupervised FSS)
Machine Learning
Filter and wrapper

Latent Semantic Indexing: solving the problems of ambiguity and synonyms by separating the examples with hyperplanes by means of Singular Value Decomposition.

Use of document references

Documents are usually not alone, but are currently in power in mutual respect, especially through links. The information from these links, and the network structure can be used by classifiers.

Hubs & Authorities

See also the article Hubs and Authorities

Authorities are pages that contain a lot of information on this topic
Hubs are pages that have lots of links to good Authorities

This results in a mutual reinforcement. Through iterative computation of scores using a hub score and authority score and subsequent normalization, the method converges. problems:

Efficiency
Irrelevant links (eg advertising links )
Mutual reinforcement of hosts
Deviations from the topic

Improvements:

Improved linkage analysis: The ratings on the number of links normalize ( Multiple links from the host are considered weaker)
Relevance weighting: the documents will have an insufficient similarity to an average document root sets not considered

Page Rank

Hypertext Classification

Frequently websites contain little text and / or lots of pictures. They are written in a different language, contain irritating terms or do not contain any useful information. Links, however, are written by several authors, contain a richer vocabulary, redundant information, tap different aspects of a page and have their focus on the important content of pages. They are therefore very suitable for classification of networked documents. The best results are obtained when considering the classes of the previous as well as the text for the classification of a page.

Hyperlink ensembles

The page text is not interesting
Any link to the page is a personal example
Detect link anchor text, heading and paragraphs in which occurs the link
The training examples (one per link) classify and calculate the classification for the current document

Swell

J. Fürnkranz: web mining - Data Mining the Web http://www.ke.informatik.tu-darmstadt.de/lehre/ss05/web-mining.html

Classification methods

Statistical classification Feature selection Wrapper (data mining) Stop words Integrated Authority File

478489