OPTICS algorithm

OPTICS ( Ordering Points To Identify the Clustering Structure, approximately, order points to identify the clustering structure ') is a density- based algorithm for cluster analysis. It was developed by Mihael Ankerst, Markus M. Breunig, Hans -Peter Kriegel and Jörg Sander. The basic principle of the algorithm comes from DBSCAN, however, the algorithm solves a major weakness of the DBSCAN algorithm: in contrast to this it can detect clusters of different density. At the same time it eliminates (largely) the parameters of the DBSCAN algorithm. To this end, OPTICS arranges the points of the data set linearly so that spatially adjacent points follow close to each other in this order. At the same time the so-called " reachability distance" is noted. If one draws these reachability distances in a chart, thus forming clusters "valleys" and can thus be identified.

The core idea

OPTICS used as DBSCAN two parameters, and. however, plays the role of a maximum distance and mainly serves to limit the complexity of the algorithm. Is one, then the complexity of the algorithm, otherwise, it may be reduced by means of suitable spatial indexing structures such as R *-tree.

In DBSCAN a point is a " core point " when its environment contains at least points. This is inverted in OPTICS, and the " core distance " is defined as the distance to the - next neighbor. The " core distance " is thus one value from a point in DBSCAN a "core point " would be. Has a point in his environment no neighbors, so its core distance is infinity, or "undefined".

The " reachability distance" of a point from a second point is defined as, ie as the maximum of the real distance and the core distance of the referencing point. Again, that in DBSCAN this distance would cause the two objects are in the same cluster.

OPTICS now arranges the objects in the database, by beginning at any unprocessed point determines the neighbors of the environment, and remembers in a priority queue according to their best-ever accessibility distance. The time now is always the one point taken next in order, which has the smallest reachability distance. By processing a new point, the reachability distances of unprocessed points can improve. By sorting the priority queue OPTICS examines the center of a cluster and processed fully, before he makes the next cluster.

Visualization

OPTICS can be visualized as a reachability graph (bottom). Here, the reachability distance is plotted on the y- axis points along the x- axis sorted by the computed by OPTICS order. " Valleys " in this diagram correspond to recognized clusters in the dataset, the depth of the valley shows the density of the cluster. As an additional visualization (top right ), each point connected to its reachability predecessor here. The resulting spanning tree visualizes the calculated density of OPTICS - connectedness of the points in the data set. When parameters have been here and used. This visualization was created using the OPTICS implementation in ELKI.

Pseudocode

The basic approach of OPTICS is similar to the habit of the " known but not yet processed " by DBSCAN, but instead a set of objects, these in a priority queue ( for example, an indexed heap) to be managed.

OPTICS (DB, eps, MinPts )      for each point p of DB         p.reachability - distance = UNDEFINED      for each unprocessed point p of DB         N = getNeighbors (p, eps)         mark p as processed         output p to the ordered list         Seeds = empty priority queue         if ( core -distance (p, eps, MinPts )! = UNDEFINED )            update ( N, p, Seeds, eps, MinPts )            for each next q in Seeds               N '= getNeighbors (q, eps)               mark q as processed               output q to the ordered list               if ( core -distance (q, eps, MinPts )! = UNDEFINED )                  update ( N ', q, Seeds, eps, MinPts ) In update () the priority queue is updated with the environment or from:

Update ( N, p, Seeds, eps, MinPts )      coredist = core -distance (p, eps, MinPts )      for each o in N         if ( o is not processed )            new- reach -dist = max ( coredist, dist (p, o))            if ( o.reachability -distance == UNDEFINED ) / / o is not in Seeds                o.reachability -distance = new -reach -dist                Seeds.insert (o, new -reach -dist )            else / / o in Seeds, check for improvement                if ( new- reach -dist < o.reachability -distance )                   o.reachability -distance = new -reach -dist                   Seeds.move -up (o, new -reach -dist ) OPTICS are the points so in a specific order from annotated with its smallest reachability distance ( the published algorithm also stores the core distance, but it is no longer needed ).

Extensions

OPTICS -OF is an anabolic on OPTICS method for outlier detection. An important advantage here is that clusters can be identified during the course of a normal OPTICS - run, without having to carry a separate outlier detection.

DeLiClu, density - link clustering combined ideas of single linkage clustering and OPTICS, thus eliminating parameters and achieves improved performance over OPTICS by using an R - tree as index.

HISC is a hierarchical ( axis- parallel ) subspace clustering method.

HiCO is a hierarchical clustering method for arbitrarily oriented subspaces.

DISH is an improvement of HISC for more complex hierarchies ( with sections of subspaces ).

Availability

A reference implementation is available ELKI of the Department, including implementations of DBSCAN and other comparison method in the software package.

622439
de