Outlier

In statistics, this is called a runaway, if a measured value or finding does not fit in an expected series of measurements or generally does not meet expectations. The "expectation" is usually defined as the spread area around the expected value around, in which most of all readings come to rest, such as the Quantilabstand Q75 - Q25. Values ​​than 1.5 times the quartile distance lie further outside this interval are (usually arbitrarily) designated as outliers. In the boxplot particularly high outliers are shown separately. The robust statistics is concerned with the problems of outliers. In data mining one is occupied with the detection of outliers.

  • 2.1 Andrews curves
  • 2.2 Stahel - Donoho Outlyingness

Checking for measurement error

The key is then to check whether it is the outlier in fact a reliable and genuine result or whether a measurement error.

Outlier Tests

Another approach was proposed by Ferguson et al in 1961. Thereafter, it is assumed that the observations from a hypothetical distribution date. Outliers are then observations that do not originate from the hypothetical distribution. The following outlier tests all assume that the hypothetical distribution and check if one or more of the extreme values ​​do not come from the normal distribution is a normal distribution:

  • Grubbs outlier test
  • Outlier test Nalimov
  • Dixon outlier test
  • Hampel outlier test
  • Outlier test Baarda
  • Outlier test Pope

The outlier test according to Walsh, however, is not based on the assumption of a particular distribution of the data. In the context of time series analysis can be time series, in which an outlier is suspected, tested it and are then modeled with a runaway model.

Differences to extreme values

A popular approach is to use the boxplot to identify "outliers". Observations outside the whiskers are here arbitrarily designated as outliers. For the normal distribution, one can easily figure out that just 0.7 % of the mass of the distribution lie outside the whiskers. Already from a sample size of one would therefore (on average) at least one observation outside the whiskers expect (or observations outside of the whiskers in ). It makes more sense, therefore, first to speak instead of outliers of extreme values ​​.

Multivariate outliers

In multiple dimensions, the situation is even more complicated. In the graph on the right of outliers can in the lower right corner not be detected by inspection of each variable; it is not visible in the boxplots. Nevertheless, it is clearly affecting a linear regression.

Andrews curves

Andrews ( 1972) suggested that each multivariate observation by a curve to represent:

So that each multivariate observation is imaged on a two-dimensional curve in the interval. Due to the sine and cosine terms, the function outside the interval repeated.

For every two observations and the following applies:

The formula (1 ) to the left of the equal sign corresponds to ( at least approximately ) the area between the two curves and the formula (2 ) on the right is ( at least approximately ) the multivariate Euclidean distance between two data points.

Thus, the distance between two data points small, then the area between the curves must be small, that is, the curves and have to run close to each other. However, the distance between two data points in size, and the area between the curves must be large, that is, the curves and have been very different. A multivariate outliers would be visible as a curve that is distinctly different from all other curves in its course.

Andrews curves have two disadvantages:

  • If the outlier is visible in exactly one variable, the man takes the different curves perceive better the earlier appears this variable. At best it should be the variable. That is, it makes sense to sort the variables, eg, the variable with the largest variance, or you can take the first principal component.
  • If you have lots of observations, number of curves to be drawn, so that the course of a single curve is no longer visible.

Stahel - Donoho Outlyingness

Stahel (1981) and David Leigh Donoho (1982 ) have defined to obtain the so-called Outlyingness to a measure that indicates how far an observation value of the mass of data away. By computing all possible linear combinations, i.e., the projection of the data point to the vector, with the resulting Outlyingness

With the median of the projected points, as a robust measure of central tendency, and the mean absolute deviation of the projected points, as a robust measure of dispersion. is a standardization.

In practice, the Outlyingness is calculated by the maximum is taken for several hundreds or thousands of randomly selected projection directions.

Outlier detection in data mining

Under the English term Outlier Detection ( German: outlier detection ) refers to the segment of the data mining, in which the objective is to identify unusual and eye-catching records. Application of this is for example the detection of ( potentially) fraudulent credit card transactions in the large amount of valid transactions. The first algorithms for outlier detection were closely followed by the foregoing statistical models, however, the algorithms have been due to calculation and in particular runtime considerations since from it. An important method for this is the density-based outlier Local Factor.

63166
de