Local Outlier Factor

The Local Outlier Factor ( LOF, such as "Local outlier factor ") is an algorithm for detecting density-based outliers by Markus M. Breunig, Hans -Peter Kriegel, Raymond T. Ng, Jörg Sander and. The core idea of LOF is to compare the density of a point, with the densities of its neighbors. One point that is " closer " than its neighbors, is located in a cluster. A point with a much lower density than its neighbors, however, is an outlier.

LOF has many concepts together with the cluster analysis algorithms DBSCAN and OPTICS.

Basic principle

LOF defines the " local environment " of a point above its nearest neighbors. This distance is used to estimate a local density. In a second step, the ratio of the local densities of its neighbors, and the local density of the dot itself is formed. This value moves close to when a point in a region of uniform density. For objects, but they are separated by such a surface, the value is significantly larger and thus identifies outliers.

Formally

That's the distance of the object to its k- nearest neighbors. This quantity can optionally contain more than k objects, if there are multiple objects at the same distance. We call this "k - neighborhood " here.

Based on this distance, the " reachability distance" is defined:

The reachability distance of an object is thus either of the true distance, but at least by the. Objects belonging to the k-nearest neighbors and thus were considered equidistant. The motivation for this distance definition is to get stable results. ( These are no longer of a distance function in a mathematical sense, since it is not symmetric. )

The local reachability density ( "local reachability density", " lrd " ) of an object is defined as

This density is therefore the inverse of the average reachability distance of the object from its neighbors, not the other way around, the average reachability distance of the neighbors, which would by definition. When duplicates in a data reachability density of these objects is infinite.

The local reachability density is now compared with those of its neighbors:

The "Local Outlier Factor" ( LOF ) is thus the " Average reachability density of neighbors " divided by the reachability density of the object itself, a value of approximately means that the object has a similar density with its neighbors (ie, no outliers ). A value smaller than means even a denser region (which is a so-called " inliers " would be ), while significantly higher values ​​marked as an outlier.

Benefits

In contrast to many global method for outlier detection LOF can deal with regions of different density in the same record. Points with a "medium" density in an environment of " high " can be classified as an outlier of LOF, while a point of "average " density in a " thin " environment is not explicitly identified as such.

While the geometric intuition of LOF makes sense only in low-dimensional vector spaces, the method can be applied to any data, on which a " contrast " can be defined. It must not be a distance function in the strict mathematical sense, the triangle inequality, for example, is not required. The algorithm was successfully used on different data sets, for example, to detect attacks on computer networks, where he gave a better recognition rate than the comparison method.

Disadvantages

An important disadvantage of LOF is that the resulting values ​​are difficult to interpret. Values ​​microns and less are certainly not outliers, but there is no clear rule, a value above which a point is a significant outlier. In a very even record values ​​are conspicuous by, in a data set with strong density fluctuations, a value can be like a very normal data point. In the worst case, such differences occur even in different parts of the same data set.

Extensions

  • Feature Bagging for Outlier Detection apply LOF in multiple projections and combines the results to obtain better results in high-dimensional data.
  • Local Outlier Probability (LOOP) is a derivative of LOF method that statistically estimates the density to be less dependent on the exact value of k. In addition, the result is statistically normal in the range of values ​​to provide better interpretable results.
  • Interpreting and Unifying Outlier Scores introduces a normalization for LOF and other methods that statistically normalized in the interval to improve the scores for the user-friendliness of the result.

Availability

A reference implementation is available in the software package ELKI available, including implementations of settlement procedures.

527264
de