K-means clustering

A k- means algorithm, a method for vector quantization, which is also used for cluster analysis. Here, a previously known number of k groups is formed from a set of similar objects. The algorithm is one of the most commonly used techniques for grouping objects, as he quickly finds the centers of the clusters. In this case, preferably, the algorithm groups with low variance, and similar size.

The algorithm has strong similarities with the Expectation-Maximization algorithm and is distinguished by its simplicity. Extensions of the k- median algorithm and the k-Means algorithm.

  • 6.1 k- median
  • 6.2 k -means
  • 6.3 k- Medoids (PAM )

Historical development

The term " k-means " was first used by MacQueen in 1967, but the idea goes back to Hugo Steinhaus 1957. The most popular class as " k- means algorithm " designated standard algorithm in 1957 proposed by Lloyd for pulse - code modulation, but not published until 1982 in a computer science journal and coincides largely with the method of Forgy the 1965 published. Another variant is the Hartigan and Wong, which avoids unnecessary distance calculations by using also the distance to the second closest center. An exact assignment of the algorithms is often done wrong: in particular, often the algorithm of Lloyd / Forgy is described, however, as the source MacQueen.

Problem

Goal of k -means is the record to share so into k partitions such that the sum of the squared deviations of the cluster centroids is minimal. Mathematically, this corresponds to the optimization of the function

With the data points and the centroids of the clusters. This objective function is based on the least squares method and it is also called clustering by variance minimization, since the sum of the variances of the clusters is minimized. Since, moreover, is the squared Euclidean distance, k-Means assigns each object effectively the nearest ( according to Euclidean distance) cluster focus on. Conversely, the arithmetic mean is a least-squares estimator, that also optimizes this criterion.

Algorithms

As the search for the optimal solution is difficult ( NP-hard ), usually an approximate algorithm is used as the heuristics of Lloyd or MacQueen. Since the problem of k is dependent, this parameter must be set by the user. However, there are also approaches to choose these objects by using a second parameter (see X -Means, AIC, BIC and Silhouette coefficient).

Lloyd algorithm

The k- means algorithm most commonly used is the Lloyd algorithm, which is often referred to as "the" k-means algorithm, although Lloyd did not use this name. Lloyd's algorithm consists of three steps:

The steps 2-3 are in this case repeated until no longer change the assignments.

MacQueen's algorithm

MacQueen led to the term "k -means " a different algorithm:

While it originally - probably - was not provided, you can also iterate this algorithm to obtain a better result.

Variations

  • K -means is trying to find better starting points.
  • The filtering algorithm uses a kd-tree as a data structure.
  • The k- means algorithm can be accelerated in consideration of the triangle.
  • Bisecting k-means begins with, and then divides always largest clusters until the desired k is reached.
  • X -means starts and increases k until a secondary criterion (AIC or BIC ) is not further improved.

Requirements

K -means optimizes the squared deviations from the mean. It can therefore only be used with numerical attributes, where a useful mean can be calculated. Categorical attributes (eg, " car ", " truck", "bicycle" ) can not be used, since there is no average can be calculated.

The parameter k, the number of clusters must be known in advance. However, it can also be determined experimentally. The problem is that the different clusters have to be compared with each other, and the cost function decreases monotonically with increasing k. One solution is the silhouette coefficient. one of k independent evaluation of clusterings supplies. This will not only examined how far a point is from your cluster focus, but it also go the distances from other cluster focal points in the evaluation of clustering with a.

The clusters in the data set must be approximately the same size, because the algorithm the set always partitioned in the middle between two cluster centers.

The data set must not contain a lot of noise and not many outliers. Incorrect data objects move, the calculated cluster centers often significantly, and the algorithm has no provisions for such effects (see DBSCAN, the "noise " objects explicitly calls ).

Problems

K -Means is a powerful algorithm, but not without flaws. A k- means algorithm does not need to find the best possible solution. The solution found depends strongly on the chosen starting points. The simplest approach is to repeatedly start the algorithm with different starting values ​​and take the best solution. However, there are many considerations, such as a suitable distribution of the initial values ​​can be obtained. Mention may be made, among others, k-means , but also with pulling small sample size, the cluster centers can be approximated before the start of k-means. Also, it makes a difference whether one chooses arbitrary cluster centers, or assigns each point a random cluster and then determines the cluster centers.

A further disadvantage is that the number of cluster centers is selected k in advance. Using an incompatible k is completely different, unintuitive sometimes solutions can arise. In a " wrong" k is not a good clustering can be done. The solution is to try different values ​​for k and then to choose a suitable, for example using the silhouette coefficient, or by comparison of the different clustering cost.

Groups in the data, as in the iris - example shown, overlap and seamlessly transferred. In such a case, k -means, these groups do not reliably separate because the data do not follow the cluster model used.

In addition, always look for k -means cluster convex (due to the minimization of the distance to the cluster center). Other algorithms such as DBSCAN can find arbitrary shaped " density based" clustering. What is also not supported by k -means, hierarchical cluster (ie cluster, which in turn have a cluster structure ), as they can be, for example, OPTICS found.

The last is assigned to a cluster in each point k-means, there is no way to recognize outliers. This may distort the results considerably. This situation can create a previous noise reduction, or other algorithms, such as DBSCAN, which can automatically detect noise.

Extensions

K- median

In the k- median algorithm the Manhattan distance is used in the assigning step, instead of the Euclidian distance. In the update step, the median is used instead of the mean.

K -means

The k -means algorithm selects the cluster focuses not random, but according to the following rule:

In general, the following k -means algorithm converges in a few steps. The results are as good as a standard k- means algorithm, but the algorithm is typically almost twice as fast as the K-Means algorithm.

K Medoids (PAM)

The algorithm PAM (Partitioning Around Medoids, Kaufman and Rousseeuw, 1990) - also known as k- Medoids - can be interpreted as a variant of the k -means algorithm that converges with arbitrary distances.

In the original version of PAM in this case makes the first step - a large part of the algorithm - the choice of the initial Medoiden. Since in each iteration only the best permutation is always performed, the algorithm is almost deterministic (up to exactly the same distances). Thus, the algorithm is also usually very slow.

While k -means minimizes the sum of the variances are minimized k Medoids distances. In particular, this algorithm can be used with any distance functions, and is converged still guaranteed.

Example

The following pictures show examples of a run of k- means algorithm for the determination of three groups:

Application in image processing

In the image processing of the k- means algorithm is often used for the segmentation. The Euclidean distance as the distance measure is often insufficient and, other distance functions may be used based on pixel intensity and pixel coordinates. The results are used for the separation of the foreground and the background and for object detection. The algorithm is widely used and implemented in major image processing libraries such as OpenCV and itk.

Software

K-means and its variants are available in various open source software.

  • ELKI contains the variants of Lloyd and MacQueen, to different strategies for the initial values ​​such as k-means , and variants of the algorithm, such as k- medians, k- medoids and PAM.
  • GNU R contains the variants of Hartigan, Lloyd and MacQueen, and additional variations in the expansion pack " flexclust ".
  • OpenCV provides a optimized for image processing version of k-means
  • Weka contains k-means (including k-means seeding ) and the extension of x -means.
458898
de