DBSCAN

DBSCAN ( Density -Based Spatial Clustering of Applications with Noise, such as: density -based spatial clustering analysis with noise) is a standard developed by Martin Ester, Hans -Peter Kriegel, Jörg Sander and Xiaowei Xu data mining algorithm for cluster analysis. He is one of the most cited algorithms in this area. The algorithm operates based density and is able to detect multiple clusters. Noise points are ignored and returned separately.

The basic idea of the algorithm is the notion of " density connectedness ". Two objects are considered to be density- connected if there is a chain of dense objects ( "Core Objects ", with more than MinPts neighbors) that connect these points together. The interconnected by the same core objects objects form a 'cluster'. Objects that are not part of a dense - related cluster are referred to as "noise" (Noise ).

In DBSCAN there are three types of points:

  • Core objects, which are self-sealing.
  • Density - reachable objects. These are properties that can be achieved by a core object of the cluster, but itself does not leak. Clearly this form the edge of a cluster.
  • Noise points that are neither tight nor density-reachable.

The algorithm has two parameters: and " MinPts ". This defines the length neighborhood of a point: accessible from a point, a second point is if and only if its distance is smaller. " MinPts " defined on the other hand, when an object tightly (ie, a core object) is: if it has at least MinPts - reachable neighbors.

Density - reachable points may be density-reachable by more than one cluster. These points are associated with one of the non- deterministic algorithm, one of the possible clusters. This also implies that density connectedness is not transitive; Density - reachability is not symmetrical.

Key Features

DBSCAN is exactly connected density- with respect to the definition of and noise. This means two density- connected objects are guaranteed in the same cluster, while noise objects are safe to Noise. Not exactly the algorithm is only for clusters density- reachable, they are only matched to a cluster, not all possible.

In contrast, for example, the K -means algorithm need not be known in advance how many clusters exist.

The algorithm can (not only spherical, for example ) detect clusters of arbitrary shape.

DBSCAN is largely deterministic and sequence- independent: Regardless of the order in which objects are stored or processed in the database, created the same cluster ( with the exception of only density- reachable non-core properties and the cluster numbering).

The algorithm can be used with any distance functions and similarity measures. In contrast to the K- means algorithm is no geometrical space is necessary because there is no midpoint needs to be calculated.

DBSCAN itself is of linear complexity. Each object is essentially just a visited times. However, the calculation of the neighborhood is usually not possible in constant time ( without corresponding precomputation ). Without the use of pre-calculated data, or a suitable index structure of the algorithm is of quadratic complexity.

DBSCAN algorithm

The original version of DBSCAN can be described by the following pseudocode:

DBSCAN (D, eps, MinPts )     C = 0     for each unvisited point P in dataset D        mark P as visited        N = D.regionQuery (P, eps)        if sizeof ( N) < MinPts           mark P as NOISE        else           C = next cluster           expand cluster (P, N, C, eps, MinPts ) expand cluster (P, N, C, eps, MinPts )     add P to cluster C     for each point P ' in N        if P 'is not visited           mark P 'as visited           N '= D.regionQuery (P', eps)           if sizeof (N ') > = MinPts              N = N joined with N '        if P 'is not yet member of any cluster           add P ' to cluster C Alternatively DBSCAN also be implemented recursively ( instead of the "join" is carried out a recursive call of N ), but this does not offer any significant advantages.

DBSCAN ( Recursive formulation)

The recursive implementation is more vivid as DBSCAN works. Since the recursion can be very high but the quantity -based normal formulation is the preferred implementation.

DBSCAN (D, eps, MinPts )     C = 0     for each unvisited point P in dataset D        mark P as visited        N = getNeighbors (P, eps)        if sizeof ( N) < MinPts           mark P as NOISE        else           C = next cluster           add P to cluster C           for P ' in N             if P 'is not yet member of any cluster               recursiveExpandCluster (P ', C, eps, MinPts ) recursiveExpandCluster (P, C, eps, MinPts )     add P to cluster C     if P is not visited       mark P as visited       N = getNeighbors (P, eps)       if sizeof ( N) > = MinPts         for P ' in N           if P 'is not yet member of any cluster             recursiveExpandCluster (P ', C, eps, MinPts ) Generalized DBSCAN

The generalized version of DBSCAN, GDBSCAN abstracted here from the neighborhood and the MinPts - density criterion. These will be replaced by a predicate " getNeighbors " and a predicate " isCorePoint ".

GDBSCAN (D, getNeighbors, isCorePoint )     C = 0     for each unvisited point P in dataset D        mark P as visited        N = getNeighbors (P)        if isCorePoint ( P, N)           C = next cluster           Expand cluster (P, N, C)        else           mark P as NOISE Expand cluster (P, N, C)     add P to cluster C     for each point P ' in N        if P 'is not visited           mark P 'as visited           N ' = getNeighbors (P')           if isCorePoint (P ', N' )              N = N joined with N '        if P 'is not yet member of any cluster           add P ' to cluster C If one uses a range request as getNeighbors and MinPts test as isCorePoint predicate, the result is obviously the original DBSCAN algorithm.

Extensions of DBSCAN

In this algorithm, based on, among other things,

  • OPTICS - Ordering Points To Identify the Clustering Structure
  • Shared Nearest Neighbor Clustering - Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data
  • PreDeCon - Density Connected Clustering with Local Subspace Preferences
  • SubClu - Density connected Subspace Clustering for High Dimensional Data
  • 4C - Computing Clusters of Correlation Connected Objects

Use of DBSCAN

The algorithm DBSCAN is included in

  • Environment for Developing KDD - Applications Supported by Index -Structures ( ELKI )
  • Waikato Environment for Knowledge Analysis ( without index support, and very inefficient )
223512
de