Support Vector Machine

A Support Vector Machine [ səpɔ ː t vektə məʃi ː n] ( SVM, the translation from English, " support vector machine" or support vector method is not commonly used ) is a classifier (see classification ). A support vector machine divides a set of objects into classes so that the class boundaries around the widest possible area remains free of objects; it is a so-called large margin classifiers (English " Wide Edge classifier "). Support Vector Machines are used for both classification and regression.

Support Vector Machines are not machines in the traditional sense, so do not consist of tangible components. It is a purely mathematical methods of pattern recognition, which is implemented in computer programs. The name part machine, accordingly, has not a machine, but refers to the area of ​​origin of Support Vector Machines, Machine Learning.

Basic operation

Basis for the construction of a support vector machine is a set of training objects for which each is known to which class they belong to. Each object is represented by a vector in a vector space. The object of the Support Vector Machine is to fit a hyperplane in this space, which acts as a release surface and the training objects is divided into two classes. The distance of those vectors that are closest to the hyperplane is maximized thereby. This wide, empty border is to ensure later that even objects that do not correspond exactly to the training objects that are classified as reliably as possible.

When inserting the hyperplane it is not necessary to consider all the training vectors. Vectors, which are further away from the hyperplane and "hidden" behind a front to some extent of other vectors not affect the orientation and position of the separating plane. The hyperplane is only dependent on the vectors of their closest - and only these are required to describe the level mathematically exact. These vectors are closest (English support vectors ) named after their function support vectors and helped the Support Vector Machines to their name.

A hyperplane can not be " bent" so that a clean separation of a hyperplane is possible only when the objects are linearly separable. In real applications, this is not the case in general. Support Vector Machines use nonlinearly separable data in case the kernel trick to feed a non-linear class boundary.

The idea behind the kernel trick is to convert the vector space and hence the contained training vectors into a higher dimensional space. Infinite in doubt - - In a room with a sufficiently high number of dimensions and the verschachteltste vector quantity is linearly separable. In this higher-dimensional space now separating hyperplane is determined. In the transformation back to the lower dimensional space, the linear to non-linear hyperplane, even unrelated may hypersurface separating the clean training vectors into two classes.

In this process, two problems arise: The high transformation is extremely computationally heavy and the representation of the separating surface in low-dimensional space in general unlikely complex and thus practically useless. At this point is where the kernel trick. One to describe the separation surface suitable kernel functions describing the hyperplane in the high-dimensional and still remain " benign" in low-dimensional is used, it is possible to implement the forward and inverse transform without actually having to perform computationally. Again takes is a part of the vectors, namely, in turn, the support vectors in order to fully describe the class boundary.

Both linear and non-linear support vector machines are more flexibly through additional slack variables. The slack variables allow the classifier, individual objects to be classified wrong to "punish " but at the same time each such failure classification. In this way, overfitting is avoided on the one hand, on the other hand, the number required is reduced to support vectors.

Mathematical implementation

The SVM determined from a set of training examples

A hyperplane that separates two classes from each other, so that the smallest distance from the hyperplane to the margin, for the examples of the two classes is maximized to ensure the best possible generalization of the classifier. This indicates the class membership of the training example. The so-called training calculates the hyperplane that separates the training examples of both classes as possible. It is then used as a decision function. It is given by a normal vector and a so-called bias. A training example is thereby assigned to the sign of the decision function as a class:

Depending on where the samples are located relative to the hyperplane ( above or below ), a positive or negative value, where after applying the sign function only remains calculated. For examples, which lie exactly on the dividing plane, the value becomes 0

Linear separable data

Many learning algorithms work with a linear function in the form of a hyperplane. Are two classes of examples by a hyperplane separable, ie linearly separable, but there is usually an infinite number of hyperplanes that separate the two classes from each other. The SVM is different from other learning algorithms in that it selects the one of all possible separating hyperplanes with minimum square norm, so that the same is true for each training example. This is consistent with the maximization of the minimum distance to the hyperplane ( the margin) equivalent. According to the statistical learning theory, the complexity of the class of all hyperplanes with a certain margin is less than that of the class of all hyperplanes with a smaller margin. This results in upper bounds for the expected generalization error of the SVM can be derived.

The optimization problem can be written as:

Non- linearly separable data

Typically, the training examples are not strictly linearly separable. This may, inter alia, be due to measurement errors in the data, or the fact that the distributions of the two classes overlap naturally. In this case, the optimization problem is changed so that violations of constraints are possible, but the injury to be kept as small as possible. For this purpose, a positive slack variable is introduced for each constraint whose value is precisely the violation of the constraints. So, it means that the constraint is violated. Since the sum of the injuries are to be kept as small as possible, the sum of the errors of the objective functions will be added, thus minimizing as well. In addition, this sum is multiplied by a positive constant that controls the balance between minimizing and correct classification of the training examples. The optimization problem then has the following form:

Dual problem

Both optimization criteria are convex and can be solved efficiently with modern procedures. This simple optimization and the property that Support Vector Machines avoid over-fitting to the test data used to design the classifier the most part, the way to great popularity and a wide range of applications have helped.

The optimization problem described above is usually solved in its dual form. This formulation is equivalent to the primal problem in the sense that all solutions of the dual problem are also solutions of the primal. The conversion results from the fact that the normal vector can be written as a linear combination of training examples:

The dual form is derived using the Lagrange multipliers and the Karush -Kuhn -Tucker conditions. It reads:

So, as a classification rule:

Your name, the SVM by a special subset of the training points whose Lagrangevariablen. These are called support vectors and are either on the Margin ( if any) or within the margin ().

Nonlinear extension with kernel functions

The above-described algorithm classifies the data using a linear function. However, this is only optimal if the classification problem underlying linear. In many applications, this is not the case. A possible way out is to map the data into a space of higher dimension.

This applies. Through this figure, the number of possible linear separation is increased ( theorem of Cover). SVMs are characterized by the fact that this extension can be installed very elegant. The data points are only recognized in the dot products in the algorithm underlying optimization problem in the formulation shown last. Therefore, it is possible to replace the dot in the input space by a dot in, and instead to compute directly. The cost of this computation can be reduced greatly if a positive definite kernel function is used instead:

By this method (ie, a linear function ) in a high-dimensional space is computed implicitly a hyperplane. The resulting classifier is in the form

With. By using kernel functions, SVMs can also operate on general structures such as graphs or strings and are therefore very versatile. Although a possibly infinite-dimensional space to be used by the mapping implicitly generalize SVM still very good. It can be shown that for maximum - margin classifiers, the expected test error is bounded and does not depend on the dimensionality of the space.

History

The idea of ​​separation by a hyperplane was first published in 1958 by Frank Rosenblatt. The idea of ​​Support Vector Machines goes back to the work of Vladimir and Aleksei Wapnik Chervonenkis. On a theoretical level, the algorithm is motivated by the principle of structural risk minimization, which states that not only the training error but also the complexity of the model used to determine the generalization ability of a classifier. In the mid- 1990s saw the breakthrough of the SVMs, and numerous further developments and modifications have been published in recent years.

756372
de