Hierarchical clustering

As a hierarchical cluster analysis refers to a certain family of distance-based methods for cluster analysis ( structure discovery in databases ). Cluster hereby consist of objects to each other a shorter distance (or vice versa: higher similarity ) than to objects in other clusters. One can distinguish the procedures in this family according to the used distance or Proximitätsmaßen ( between objects but also between entire clusters ) and after their calculation rule.

You Subdivided into the calculation rule, so there are two important types of processes:

  • The divisive clustering method in which first considers all objects as belonging to a cluster and then gradually the cluster already formed are subdivided into smaller and smaller clusters until each cluster consists of only one object. ( Also referred to as " top-down " process )
  • The agglomerative clustering method, in which initially each object forms a cluster and then gradually the already formed clusters are combined to form larger and larger until all objects belong to a cluster. ( Also referred to as " bottom-up " method )

For both methods, that clusters, once formed, can not be changed. The structure is either always only refined ( " divisiv " ) or only coarsened ( " agglomerative " ), so that a strict cluster hierarchy. On the resulting hierarchy can no longer see how it was calculated.

  • 4.1 Fusionierungsalgorithmen
  • 4.2 Examples to Fusionierungsalgorithmen
  • 4.3 Density Linkage
  • 4.4 Efficient calculation of Fusionierungsalgorithmen 4.4.1 Lance and Williams formula
  • 4.4.2 SLINK and CLINK
  • 7.1 Principles and methods
  • 7.2 Application

Advantages and Disadvantages

The advantages of hierarchical cluster analysis are the flexibility through the use of complex distance measures, except that the method of the distance function and the Fusionierungsmethode does not own parameters and that the result is a cluster hierarchy that allows substructures.

The downside is the cost analysis of the result. Other methods, such as the k- means algorithm, or also DBSCAN OPTICS hierarchical algorithm, provide a single partitioning the data into clusters. A hierarchical cluster analysis provides a number of such partitions, and the user must decide how he partitioned.

Another disadvantage of the hierarchical cluster analysis is the run-time complexity. An agglomerative calculation occurs in practice and research far more common, since there are ways in each step to combine cluster, which leads to an overall complexity of naive. In special cases, however, the method with a total complexity of are known. Divisiv there are naive in every step ways to divide the data set.

Another disadvantage of the hierarchical cluster analysis is that it does not provide clustering models. So arise, for example, depending on the dimensions used chain effects ( " single-link effect" ) and are often generated from outliers tiny clusters that consist of only a few elements. The clusters found must therefore usually analyzed retrospectively to obtain models.

Dendrogram

  • Record and dendrogram

Dendrogram for single- linkage. and and and are summarized first.

To visualize the resulting in a hierarchical clustering tree, the dendrogram can (Greek δένδρον ( dendron ) = tree ) can be used. The dendrogram is a tree representing the hierarchical decomposition of the amount of data in increasingly smaller subsets. The root represents a single cluster containing the entire amount. The leaves of the tree represent clusters in which there is ever a single object in the dataset. An internal node represents the union of all its child nodes. Each edge between one node and one of its child node as an attribute or the distance between the two representative sets of objects.

In statistics, the dendrogram is shown in a scatter plot. On one axis, the objects are specified, and on the other axis the distance or ( dis) similarity or a monotonic transformation is the same. When two clusters are merged, then these clusters have a certain distance or ( dis) similarity to each other. At this height, the connecting line is drawn, for example, the objects were RR1 RR4 and joined together at a value of the similarity measure of about 62.

This can be use to determine the number of clusters, in which one selects a suitable height in the dendrogram. Typically, one looks for a place where it between two mergers a big jump (or drop) the distance or ( dis) similarity is, for example, in the right dendrogram at the height of 40th Then follow four clusters, of which 2 only individual objects ( RR2, RR5 ), a cluster contains two objects ( RR3 and RR6 ) and the last cluster contains all other objects. Are there hierarchical clustering with significantly different object numbers, it may be necessary to partition at different levels: while a cluster at an altitude still connected to its neighbors, breaks another ( " thinner " ) cluster at this level already in individual objects.

Distance and Similarity Measures

Both in the agglomerative and divisive hierarchical clustering in the analysis, it is necessary clearances or ( dis) similarities between two objects to calculate an object and a cluster or two clusters. Depending on the level of measurement of variables underlying different measures are used:

  • For nominal and ordinal variables similarity measures, ie A value of zero means that the objects have a maximum dissimilarity. These can be converted to distance measures.
  • For metric variables distance measures are used, ie A value of zero means that the objects, so have a distance of zero maximum similarity.

The following table shows some similarity or distance measures for binary and metric variables. Categorical variables with more than two categories can be converted into a number of binary variables. The Gower distance can also be defined for nominal scaled variables.

Examples

  • An Internet booksellers white for two visitors who book websites they viewed, for each of the sites so a 0 = 1 = not considered or considered saved. Which measure of similarity makes sense to learn how similar the two visitors are? The number of book websites that are neither of the two visitor has viewed (), of which there are many, should not be included in the calculation. A possible coefficient would be the Jaccard coefficient, ie the number of book websites, who looked both visitors ( ) divided by the number of book websites, which has looked at at least one of the two visitors (number of book websites that only the first visitor has viewed and number of book websites, which is only the second visitor has viewed ).
  • In the ALLBUS data is inter alia after the assessment of the current economic situation, Very Good, Good, part - part, Bad and Very Bad asked with the answers. For each of the possible answers is now a binary variable is formed so that the binary similarity measures can be used. It should be noted that in several variables with different category number nor a weighting with respect to the category number should take place.
  • In the Iris data set, the four ( ) Dimensions of Irises petals are considered. To calculate the distances between two petals and, for example, the Euclidean distance can be used.

Which similarity or distance measure is used ultimately depends on the desired substantive interpretation of the similarity or distance measure.

Agglomerative calculation

The agglomerative hierarchical cluster analysis is a calculation of the easiest and most flexible case. At the beginning of each object is initially regarded as a distinct cluster. Then in each step, the respective mutually next cluster are combined to form a cluster. If there is a cluster of several objects, then it must be specified as the distance between clusters is calculated and h differ each agglomerative method. The process can be terminated when all clusters exceed each other a certain distance / similarity / or so that if a sufficiently small number of clusters has been determined. This is for clusters with only one object, as they are defined at the beginning, trivial.

Need to carry out an agglomerative cluster analysis

  • A distance or similarity measure for determining the distance between two objects, and
  • A Fusionierungsalgorithmus for determining the distance between two clusters can be selected.

The choice of Fusionierungsalgorithmus is often more important than the distance or similarity measure.

Fusionierungsalgorithmen

The following table shows an overview of common Fusionierungsalgorithmen. The distance between two clusters, and often is calculated over the distance or dissimilarity of two objects ):

Wherein the center of the cluster is that of the cluster.

Other methods are:

Wherein the center of the cluster is that of the cluster. This method tends to form equal-sized clusters.

Of practical relevance here is mainly single linkage, as it allows an efficient calculation method using the algorithm SLINK.

Examples Fusionierungsalgorithmen

This is particularly evident in the second step of the algorithm. When using a certain distance measure in the first step, the two next to each other objects have been fused to form a cluster. This can be represented as a distance matrix as follows:

The smallest distance is found between object1 and object2 ( red in the distance matrix ) and one would therefore object1 and object2 together in a cluster ( merge ). Now, the matrix must be re-created ( " o " stands for or ), ie the distance between the new cluster and Object3 or object4 must be recalculated ( yellow in the distance matrix ):

Which of the two values ​​for the distance determination is relevant, the method determines:

Density Linkage

When Density Linkage a density value is estimated for each object. For the calculation is one of the conventional distance metrics such as Euclidean distance, Manhattan distance, used between the objects. A new distance between them is then calculated based on the density values ​​of two objects. These also depend on the environment of objects and down. Then one of the preceding Fusionierungsmethoden can be used for the agglomerative clustering.

A problem of density linkage algorithms is to define the parameters.

The algorithms OPTICS and HDBSCAN * ( a hierarchical variant of DBSCAN clustering ) can also be interpreted as a hierarchical density linkage clustering.

Efficient calculation of Fusionierungsalgorithmen

Lance and Williams formula

However, the fusing of the cluster, it is not necessary to always calculate the new distances between the objects. Instead, you start as in the above example with a distance matrix. If it is clear which clusters are merged, only the distances between the merged cluster and all other clusters need to be recalculated. However, the new distance between the merged cluster to another cluster from the old distances using the formula of Lance and Williams can be calculated:

Lance and Williams also have their own Fusionierungsmethode based on their formula specified: Lance -Williams flexible - beta.

For the various Fusionierungsmethoden various constants, and that can be seen in the following table are. This means the number of objects in the cluster.

SLINK and CLINK

While the naive calculation of a hierarchical cluster analysis has a poor complexity ( for complex similarity measures, a term of or occur ), so there are more efficient solutions for some cases.

So there is for single- linkage agglomerative an optimally efficient process called SLINK with the complexity, and a generalization of it to complete- linkage CLINK also the complexity. For other Fusionierungsmethoden as average- linkage no efficient algorithms are known.

Example

The Swiss bank note data set consists of 100 genuine and 100 counterfeit Swiss franc banknotes in 1000. At each banknote six variables were collected:

  • The width of the banknote (WIDTH )
  • The amount of the bill on the left side ( LEFT),
  • The amount of the bill on the right side (RIGHT )
  • The distance of the color print to the upper edge of the banknote (UPPER )
  • The distance of the color print to the lower edge of the bank note ( LOWER) and
  • The diagonal ( bottom left to top right) of the colored print on the bill ( DIAGONAL ).

As a distance measure, this offers an Euclidean distance to

And for the following graphs, several hierarchical clustering methods were then applied. Each graph consists of two parts:

  • In the left part of the first two principal components of the data are shown. This representation is chosen because in this ( two-dimensional ) representation of the distances in the surface well correspond to the distances in six-dimensional space. So is there two clearly distinct clusters (distances between clusters are large ), it is hoped to see this also in this representation. The data points that belong to a cluster with the same color marks; Only in the case the black data points, it is such that each data point is a cluster.
  • On the right side we see the associated dendrogram. The " height " in the y-axis indicates, wherein " spacer " Observations and cluster are combined to form a new cluster (corresponding to the Fusionierungsalgorithmus ). Belonging to the two sub- clusters for a merger to the same cluster dendrogram is drawn in the corresponding color of the cluster; they belong to different clusters, then the color used black. The gray dots left in the dendrogram indicate again, in which " distance" a merger took place. To determine a good number of clusters, the largest possible gap in the gray points is searched. Because a large gap means that at the next merging a large distance between the clusters is to be fused.

Data and the dendrogram with the Ward method.

Data and the dendrogram for the complete- linkage method.

Data and the dendrogram for the single- linkage method.

Data and the dendrogram with the median method.

Data and the dendrogram with the centroid method.

Divisive calculation

As mentioned above, there is theoretically possible to share a data set of objects into two parts. Divisive methods, therefore, usually need a heuristic to generate candidates that can then be evaluated, for example, with the same dimensions as in the agglomerative calculation.

Kaufman and Rousseeuw (1990 ) describe a Divisive Clustering Procedure ( Diana ) as follows:

Another special algorithm is the spectral relaxation.

227989
de