Boosting (machine learning)

Boosting (English " amplifying " ) is an algorithm of automatic classification, which merges several weak classifiers to form a single good classifier.

The basic concepts needed to understand are explained in the article classification.

They were introduced in 1990 by Robert Schapire and of Schapire and Yoav friend developed ( AdaBoost ).

Areas of application

Boosting can be used wherever an automatic classification is needed in two classes, for example, images of faces in "known" and " unknown " or " error" rate products on an assembly line as "OK " or. The applications are therefore almost as varied as the automatic classification itself.


Although there are far more sophisticated methods for designing classifiers, Boosting in many cases forms an acceptable alternative: The technique gives acceptable results and can be implemented in a computer program that is economical in memory consumption and quickly at runtime easily.


The default is a set of objects and a set of weak classifiers. Wanted is a classifier that classifies the objects as accurate as possible in two classes. Boosting combines the existing weak classifiers such that the resulting new classifier makes as few errors.

Weak classifiers, also base classifiers (English " Basisklassifikatoren " ) or weak learners (English " weak learner " ) called, are very simple and usually only consider a single characteristic of the objects. In itself, they provide poor results therefore on the one hand, but on the other hand can be evaluated very quickly. Boosting performs all the weak classifiers so together with a weighting that the stronger among the weak classifiers especially considered that are really weak, however, ignored.


Given a feature space of arbitrary dimension and M is a training sample T of size n, ie a set of pattern vectors x1, ..., xn. Of each of the pattern vectors is well known in which class it belongs to, that is, for each x i is yi ∈ { 1, -1 }, where, indicating in which of the two classes of 1 or -1 is one of the pattern vector. M further primitive classifiers f1, f2, ..., fm are: M → { 1, -1 }, where, the split in each of the feature space into two classes 1 and -1.

We are looking for the m weighting factors w1, ..., wm of the classifier F: M → { 1, -1}, by about the sign function

Is given. The weighting factors are to be optimized so that F makes as few errors.

For the optimization, provides about the exponential function e defined, so-called " exponential loss function " L as an optimization criterion:

L is smaller, the fewer objects F misclassified. The goal is therefore, the weighting factors to be chosen so that L is minimized.

This optimization is carried out stepwise over m, that is, initially, only W1 optimized, then W2, then W3, and so on, until all the weighting factors are optimum. The optimization is discussed in the next section.

Gradual optimization

The gradual optimization requires m cycles to optimize all weighting factors for F. In each run, a classifier Fs generated by the previously generated classifier FS-1, a weak classifier is added. This means that the user can cancel the calculation after each run, if the intermediate result already meets his standards.

Before each run is judged which sample vectors can be well classified with the classifier generated so far and which are not. Those pattern vectors that are not yet well classified are considered particularly strong in the next run. To in each pass sn auxiliary variable TS, 1, ..., ts, N is needed. The higher the value of ts, i, the stronger the pattern vector xi is included in the current passage.

The number of the pass is s:

Update 1 weights.

2 -weighted training error determined.

3 optimize neighbor weighting factor.

4 establish intermediate result.

These steps are repeated in that order until all weak classifiers have been considered, so is s = m, or the user cancels the progress.

Weak classifiers

Typical weak classifiers are called decision stumps (English " decision stumps "). These functions compare the value of a single coordinate j with a threshold l and thus justify their decision to 1 or -1. If x: = ( x1, ..., xd ) ∈ M is a pattern vector in the d -dimensional feature space M, then such a primitive classifier generally has the form f:

More specifically, the feature space f is divided by a hyper-plane into two classes.

The name alludes to the analogy with decision trees: The generated Gesamtklassifikator F can be viewed as a decision tree. Each weak classifier is an internal node of the tree on which a sub-tree ( cf. Engl stump. " ( Tree ) stump " ) setting. The final classification in one of the leaves of the tree is obtained as a sequence of binary decisions (English decision ).

Such decision stumps are the basis for Boosting very popular because they are easy to handle and can be evaluated extremely fast. They must also not be specified from the beginning, but can be created, while the algorithm is running.

Subspecies of Boosting

  • AdaBoost
  • AsymBoost
  • Brown boost
  • DiscreteAB
  • Float boost
  • GentleAB
  • GloBoost
  • KLBoost
  • LogitBoost
  • RealAB
  • Weight boost