Cluster analysis

This product was added to computer science because of the content, defects on the quality assurance side of the editor. This is done to bring the quality of the articles from the computer science subject area to an acceptable level. Help us to eliminate the substantive shortcomings of this article and take part you in the discussion! ( )

In cluster analysis ( clustering algorithm, sometimes also: Concentration analysis) understands procedures for the detection of similarity in structures ( large ) datasets. The thus found groups of "similar" objects are called clusters, the group assignment as clustering. The similarity groups found can be graph-theoretical, hierarchical, partitionierend or optimizing. Cluster analysis is an important discipline of data mining, the analysis step of the Knowledge Discovery in Databases process.

In the cluster analysis, the objective is to identify new groups in the data ( as opposed to the classification in the existing data classes assigned ). One speaks of a " uninformed method " because it does not rely on prior knowledge class. These new groups can then be used for example for the automated classification, the detection of patterns in image processing or market segmentation (or any other method that rely on such prior knowledge ).

The various algorithms differ mainly in their similarity and group term, their cluster model, its algorithmic approach (and therefore their complexity ) and the tolerance to errors in the data. Whether the algorithm generated by such "knowledge" is useful, but only an expert can tell usually. A clustering algorithm can possibly reproduce existing knowledge (eg personal data in the known groups "male" and "female " divide ), or do not generate useful groups for the purpose. The groups found are often not verbally describe (eg, " males " ), common properties are usually identified only by a subsequent analysis. In the application of cluster analysis, therefore, it is often necessary to try different methods and different parameters to pre-process the data and select or omit, for example, attributes.

  • 4.1 Graph Theoretical Cluster 4.1.1 DBSCAN - Density Related cluster
  • 4.1.2 Basics
  • 4.1.3 Basic Concepts
  • 4.1.4 Cliques and related components
  • 4.3.1 k- means algorithm 4.3.1.1 peculiarities k- means algorithm
  • 4.3.1.2 Extensions and special cases
  • 5.1 Principles and methods
  • 5.2 Application

Example

Applied to a dataset of vehicles could be a clustering algorithm ( and subsequent analysis of the found groups) for example, provide the following structure:

The following must be observed:

  • The algorithm itself does not provide interpretation ( " truck" ) of the groups found. For this purpose, a separate analysis of groups is necessary.
  • A person would view a pedicab as a subgroup of bicycles. But for a clustering algorithm 3 wheels are often a significant difference, which they share with a tricycle scooter mobile.
  • The groups are often not "pure", meaning it can, for example, small trucks in the group of cars to be.
  • It often occur additional groups that were not expected ( " police cars ", " convertible ", " red cars ", "cars with xenon headlights ").
  • Some groups are not found, such as " motorcycles " or " sun wheels ".
  • Which groups are found, strongly depends on the algorithm parameters used and object attributes used.
  • Often nothing ( meaningful ) is found.
  • In this example, only known knowledge ( re-) was found - as a method of "knowledge discovery" of the cluster analysis is thus here failed.

History

Historically, the process comes from the taxonomy in biology, where a clustering of related species an order of beings is determined. However, no automatic calculation methods were used there originally. Meanwhile, among other things, their gene sequences can be compared to determine the relationship of organisms. See also: cladistics

Principle / operation

Mathematical modeling

The properties of the objects to be examined mathematically considered as random variables. They are usually presented in the form of vectors as points in a vector space whose dimensions are the property characteristics of the object. Areas in which to accumulate points ( point cloud), clusters are called. For scatter charts, the distances of the points to each other or the variance within a cluster are used as so-called Proximitätsmaße, which bring the similarity or difference between the objects expressed.

A cluster can be defined as a group of objects based on a calculated center of gravity have a minimum distance sum by reference. For this purpose, the choice of a distance measure is required. In certain cases, the distances (or, conversely, the similarities ) among the objects known directly, so they do not have to be determined from the representation in the vector space.

Basic Procedure

In a total amount / group with different objects, the objects that are similar are grouped together into groups ( clusters ).

As an example, the following musical analysis is given. The values ​​are derived from the proportion of pieces of music that purchases of users per month online.

In this example, you would divide the people into two groups intuitive. Group 1 consists of Person 1 & 2 and Group 2 consists of 3 The person would also make the most of clustering algorithms. This example is clearly only due to the selected values ​​, the latest with values ​​lying closer together and more variable (in this case musical genres ) and objects ( here people), it is easy to imagine that the division into groups is not so trivial.

A bit more precisely and abstract words, the objects of a heterogeneous (as described by different values ​​of its variables ) Total quantity to be using the cluster analysis to subgroups ( clusters / segments) summarized that ( the differences of the variables as low as possible ) are in themselves as homogeneous as possible. In our example, the music group of all music listeners can ( a very heterogeneous group ) in the groups of jazz listeners, rock listeners, Pophörer, etc. (each relatively homogeneous) can be divided or the Handset be summarized with similar preferences to the appropriate group.

A cluster analysis (eg, for market segmentation ) is done in the following steps:

Steps for clustering:

  • Find outliers by the single linkage method ( hierarchical method )
  • Determining the number of clusters by the Ward method ( hierarchical method )
  • Determining the cluster composition through the exchange process ( partitionierendes procedure)

Further steps of the cluster analysis:

Different Proximitätsmaße / scales

The scales indicate the range of values ​​that the considered ( n ) variable (s) can assume the object. Depending on what kind of scale is present, one has to use a suitable proximity measure. There are three main categories of scales:

  • Binary scale
  • Jaccard coefficient ( similarity )
  • Lance Williams measure ( distance measure )
  • Nominal scales
  • Chi -square measure ( distance measure )
  • Phi -square measure ( distance measure )
  • Metric scales
  • Pearson correlation coefficient ( similarity )
  • Euclidean metric ( distance measure )
  • Minkowski metric ( distance measure )

Forms of grouping ( group membership )

There are three different forms of grouping ( group membership ) is possible. In the non-overlapping groups, each object, only a group ( segment clusters) is allocated to an object may be assigned to multiple groups at the overlapping groups, and wherein the fuzzy groups element belongs to each group with a certain degree of applying.

A distinction between "hard" and "soft" clustering algorithms. Hard methods (eg k-means, spectral clustering, kernel PCA) arrange each data point to exactly one cluster, whereas in soft methods (eg EM algorithm with mixture -of- Gaussians model) to each data point for each cluster a level is assigned, with which this data point can be associated with this cluster. Soft methods are especially useful when the data points are relatively homogeneously distributed in space and enter the cluster only as regions with increased data point density in appearance, ie if there is, for example, transitions between the clusters or background noise (hard methods are useless in this case).

Distinguish the cluster method

Clustering methods can be divided in graph theory, hierarchical, partitioning, and optimizing processes as well as in other sub- processes.

  • Partitioning method
  • Hierarchical methods

Note that one can still distinguish various other methods and algorithms monitored include ( supervised ) and non- controlled ( unsupervised ) algorithm or model-based algorithms, in which an estimate of the underlying distribution of the data is made (for example, mixture -of- Gaussians model).

Algorithms / methods

Graph Theoretical Cluster

DBSCAN - Density Related cluster

Basics

And a density -based clustering clusters are regarded as objects in a d -dimensional space, which are located close to each other, separated by areas having a lower density.

Basic concepts

  • Maximality: ∀ p, q ∈ O: is if p ∈ C and q is density-reachable from p, then q ∈ C.
  • Bonding: ∀ p, q ∈ C: p is density- connected to q.

An important expansion of the algorithm is permitted DBSCAN OPTICS, the cluster can also recognize different density, provides a hierarchical result, and a visual evaluation.

Cliques and connected components

Two extremes in the clustering in networks form the division into connected components ( Single Link ) and cliques.

Hierarchical clustering methods

As a hierarchical cluster analysis refers to a certain family of distance-based methods for cluster analysis ( structure discovery in databases ). Cluster hereby consist of objects to each other a shorter distance (or vice versa: higher similarity ) than to objects in other clusters. You can see the process in this family differ according to the used distance or Proximitätsmaßen ( between objects but also between entire clusters ) and after their calculation rule.

You Subdivided into the calculation rule, so there are two important types of processes:

  • The divisive clustering method in which initially all objects are considered as belonging to a cluster and then progressively cluster into smaller and smaller clusters divided until each cluster consists of only one object. ( Also referred to as " top-down " process )
  • The agglomerative clustering method, in which initially each object forms a cluster and then progressively cluster into larger and larger clusters are merged until all objects belong to a cluster. ( Also referred to as " bottom-up " method )

For both methods: cluster once formed, can not be changed or individual objects to be exchanged. It is always only the structure of either refined ( " divisiv " ) or only generalized ( " agglomerative " ), so that a strict cluster hierarchy. On the resulting hierarchy can no longer see how it was calculated.

Partitioning cluster method

K- means algorithm

In the k-means algorithm the number of clusters is determined before the start. A function for calculating the distance between two objects has to be given to average and compatible.

The algorithm is as follows:

Peculiarities k- means algorithm
  • The k-means algorithm provides for different starting positions of the cluster centers may have different results.
  • It must be appropriately chosen k, and the quality of the result depends strongly on the starting positions from.
  • It may be that a cluster remains empty in one step and thus (for lack of predictability of a cluster center ) can no longer be filled.
  • To find an optimal clustering belongs to the complexity class NP. The k -means algorithm is not necessarily the optimal solution, but is very fast.

To avoid these problems, you simply start the k- means algorithm new in the hope that in the next run by other random cluster centers a different result is delivered. Because of the above shortcomings of the theoretical k-means algorithm is considered to be " quick'n'dirty " heuristic, however, because it often provides useful results.

Extensions and special cases

Extensions of the k-means algorithm, the k - median algorithm, the k- Means algorithm or fuzzy C -Means. The Isodata algorithm can be viewed as a special case of K -means.

EM algorithm

The EM algorithm modeled clusters with the help of multivariate normal distributions that are similar to k -means are iteratively refined. Similar to k -means must also be set here in advance how many normal distributions are to be used for modeling.

For the calculation of mean values ​​and covariance matrices of the addition, the data objects can be interpreted as vectors of a fixed dimension.

Unlike k-means thus is achieved a "soft" cluster allocation With a certain probability, every object belongs to each cluster. Each object that manipulates the parameters of each cluster according to this probability. The success of the algorithm strongly depends on the assumed probability distribution.

Spectral clustering

This algorithm is widely used in image processing, but can also be used for clustering of web search.

Maximum Margin Clustering

Problem: When clustering, there are no labels to the examples. The task is to find a labeling of the instances, which allows the greatest distance ( margin ) between the clusters.

Naive approach to solving the problem:

Multiview clustering

Usual clustering algorithms can only clusters in a vector space. The MultiView approach enables parallel clusters in different vector spaces. Web pages can be shown, for example in the TF - IDF room. Then each entry in the feature vector is assigned to the frequency of the word in the given document. On the other hand, they can also be interpreted as the sum of its incoming links - then each entry in the feature vector contains exactly then a 1 if there is a link to the current page of the respective source page. Combining these two views using Multiview clustering, the resulting results are demonstrably better quality than with simple concatenation of the feature vectors.

Execution of the algorithm on the example websites:

Self-Organizing Maps ( SOMs )

Fuzzy Clustering

Object sets can contain elements that none of the clusters obtained are prototypically due to ambiguous data and findings could be assigned to multiple clusters. In fuzzy clustering objects from becoming blurred ( ie with a certain degree of membership from the real-valued interval [ 0,1] ) distributed on clusters. In the special case of belonging = 1 or = 0 the element of belonging to a cluster is completely or not at all belong. The most famous fuzzy clustering algorithm is the fuzzy C -means algorithm.

101952
de