Silhouette (clustering)

The silhouette is on for an observation of how good the assignment to the two closest clusters. The silhouette coefficient indicates independent of the number of clusters measure of the quality of a clustering. The Silhouettenplot visualized both all silhouettes of a data set as well as the silhouette coefficient for each cluster and the total data set.

Silhouette

The object belongs to the cluster as is the silhouette of defined as:

With the distance of an object to the cluster and the distance of an object to the nearest cluster. The difference in the distance is weighted by the maximum distance. Hence it follows that for an object located between -1 and 1:

  • Is the silhouette, then the objects of the closest cluster closer to the object than the objects of the cluster to which the object belongs. This indicates that the clustering can be improved.
  • Is the silhouette, then the object lies between two clusters and

The distance is calculated as

As the average distance between all the objects in the cluster, and the object (the number of objects in the cluster). Similarly, the distance to the nearest cluster calculated as the minimum average distance

The distance is used for all the clusters that do not include the object, is calculated. The nearest cluster is the one having the smallest distance.

Silhouette coefficient

Silhouette of the coefficient is defined as

Therefore defined as the arithmetic mean of all the silhouettes of the cluster. The silhouette coefficient can be calculated for each cluster or the total data set.

The k-means algorithm or K- medoid of it can be compared with the results of several runs of the algorithm, in order to obtain better parameter. This is particularly suitable for the mentioned algorithms, as they start randomly and can find as many different local maxima. The influence of the parameter can be reduced, since the coefficient of the silhouette of the cluster number is independent, and thus can compare results obtained with different values ​​.

Silhouettenplot

The graphical representation of the silhouettes is done for all observations together in a Silhouettenplot. For all observations that belong to a cluster, the value of the silhouette is represented as a horizontal ( or vertical ) line. The observations in a cluster are being sorted by the size of the silhouettes.

In the right graphs the data for four different sets of data, the dendrogram of a hierarchical cluster analysis ( Euclidean distance, single linkage ) and the Silhouettenplot for the solution with two clusters shown ( top to bottom ). The assignment of the data points by the hierarchical cluster analysis in the two- cluster solution is symbolized by the color red (assignment to cluster 1 ) and blue (assignment to cluster 2).

The more the two clusters are separated in the data ( from left to right ), the better to assign the data points correctly, the hierarchical cluster analysis. Also the Silhouettenplot changed. During occur negative silhouettes for the left data set can be found in the rightmost record only positive silhouettes. Also the silhouette coefficients are left to the far right is greater, both for the individual cluster as well as for the entire data set.

Example

The Iris flower data set consists of every 50 observations of three types of irises (Iris setosa, Iris virginica and Iris versicolor ), on each of four attributes of the flowers were collected: the length and the width of sepal ( calyx ) and the Petalum ( Kronblatt ). Right shows a scatterplot matrix, the data for the four variables.

For the four sizes, hierarchical cluster analysis was performed using the Euclidean distance and the single linkage method. Above following graphs are shown:

  • Top left: A dendrogram of the cluster solution. Here you can see that a two- or four-cluster solution was offering.
  • Top right: Graphical representation of the silhouettes of the two- cluster solution. In the first cluster are negative silhouettes to find, so these observations are associated with more wrong. Possibly a solution with more clusters is more appropriate.
  • Bottom left: Graphical representation of the silhouettes of the three- cluster solution. The first cluster is divided into two sub- clusters (); Although the negative silhouettes are gone in the first cluster, but observations in the second cluster have now negative silhouettes.
  • Bottom right: Graphical representation of the silhouettes of the four-cluster solution. The second cluster of the two- cluster solution is then decomposed into two sub- clusters (). There are almost no negative silhouettes more.

This produces the following silhouette coefficient

730320
de