Classification in machine learning

A classifier ( computer science ) is classifies ( eg, documents ) according to their characteristics into predetermined categories, an algorithm of the objects. The term " classifier " is usually used specifically for such algorithms, in which precedes the classification of objects, a learning phase ( "Training" ). This methodology is used, for example in web mining, bioinformatics, and robotics often used.

  • 5.1 Hubs & Authorities
  • 5.2 Page Rank
  • 5.3 Hypertext Classification
  • 5.4 Hyperlink ensembles


In the learning phase, the classifier is provided a set of training data. The following are the learning scenarios:

  • Supervised learning ( supervised learning ): For each training example, the categorization is known.
  • Semi - supervised learning ( semi- supervised learning ): Only part of the training examples is categorized. Among the algorithmic approaches for this scenario can be found ( among many others, some more sophisticated ideas): Self-training (self - training): The classifier assigns the examples of a self.
  • Co - Training: Two classifiers assign each one of its examples.
  • Multi -view learning: Two or more classifiers that were trained differently, assign each one of its examples.

Learning methods

Classifiers can learn in very different ways and utilize their training examples. The choice of the algorithm is dependent on the type of the training data, as well as the available power reserves. The following learning methods are used:

Other approaches to learning, for example,

  • Learning as Search
  • Generalization with Bias ( The learner has certain " preferences ", attracts classes before )
  • Divide and conquer strategies
  • Overfitting - avoidance: The model generalizes to reduce the risk of errors (eg: Ockham's Razor )


For the evaluation of the learned model following methods are used:

  • Validation by experts
  • Validerung on data
  • Online validation

The Valdierung data has as disadvantage that training data for validation are "wasted" ( out- of-sample testing). This dilemma can by cross - validation ( cross-validation ) are dissolved: One divides the data into n parts. Then one uses for each partition p, the other n -1 partitions for learning and the partition p for testing.

To evaluate a classifier three sizes are standardly given based on the following table:

Hit accuracy ( Accuracy)

The hit accuracy is the proportion of correctly classified examples.

Memory ( recall)

The memory is the proportion of positive examples that were classified positive.

Precision (Precision )

The precision indicates the proportion of positive classified examples, which are actually positive.

Additional properties (Feature Engineering )

The approaches to the extension of classifiers are broken down as follows:

  • Text -dependent properties n-grams: exploiting the context by using sequences of length n instead of individual words ( thereby distinguishing such as " mining" in " web mining " and " coal mining " )
  • Position information
  • Traced back to its root word inflected verbs: Stemming
  • Noun phrases: Focus of n-grams to " real" sets limit ( for example, only bigrams with Noun - Noun - Noun and adverb )
  • Linguistic phrases: Find all occurrences of a syntactic templates ( eg " noun - verb aux " finds " I am " )
  • Structural markup
  • Hypertexts
  • Frequency based
  • TF - IDF ( unsupervised FSS)
  • Machine Learning
  • Filter and wrapper
  • Latent Semantic Indexing: solving the problems of ambiguity and synonyms by separating the examples with hyperplanes by means of Singular Value Decomposition.

Use of document references

Documents are usually not alone, but are currently in power in mutual respect, especially through links. The information from these links, and the network structure can be used by classifiers.

Hubs & Authorities

See also the article Hubs and Authorities

  • Authorities are pages that contain a lot of information on this topic
  • Hubs are pages that have lots of links to good Authorities

This results in a mutual reinforcement. Through iterative computation of scores using a hub score and authority score and subsequent normalization, the method converges. problems:

  • Efficiency
  • Irrelevant links (eg advertising links )
  • Mutual reinforcement of hosts
  • Deviations from the topic


  • Improved linkage analysis: The ratings on the number of links normalize ( Multiple links from the host are considered weaker)
  • Relevance weighting: the documents will have an insufficient similarity to an average document root sets not considered

Page Rank

See also the article PageRank.

The idea behind the Page Rank is the "random surfer " who clicks with probability d to a random link on a page.

The page rank of preferred pages

  • Many inbound links
  • Predecessors with a high Page Rank
  • Predecessors with few outgoing links

Hypertext Classification

Frequently websites contain little text and / or lots of pictures. They are written in a different language, contain irritating terms or do not contain any useful information. Links, however, are written by several authors, contain a richer vocabulary, redundant information, tap different aspects of a page and have their focus on the important content of pages. They are therefore very suitable for classification of networked documents. The best results are obtained when considering the classes of the previous as well as the text for the classification of a page.

Hyperlink ensembles

  • The page text is not interesting
  • Any link to the page is a personal example
  • Detect link anchor text, heading and paragraphs in which occurs the link
  • The training examples (one per link) classify and calculate the classification for the current document


J. Fürnkranz: web mining - Data Mining the Web

  • Classification methods