Projection Pursuit

Projection Pursuit ( literally tracking the projection ) is a statistical method to simplify a lot of high-dimensional data so that as "interesting" structures are revealed in it. To a hyperplane is determined (e.g., a surface ) in the plane spanned by the data space to which data can be projected.

The Projection Pursuit was first published in 1974 by John W. Tukey and Jerome H. Friedman and became more widespread by the work of Peter J. Huber ( around 1985 ).

The analysis of multivariate data is usually done by a suitable imaging in lower dimensions. The best known example is the scatter plot, in which two dimensions form the axes of a coordinate system. Any such figure obscures the visibility of existing structures is always more or less, but they can never increase.

Projection Pursuit of the idea has been applied to a wide variety of statistical problems:

  • Exploratory Projection Pursuit for the detection of interesting structures in data
  • Projection Pursuit Regression
  • Projection Pursuit Density Estimation
  • Projection Pursuit Classification
  • Projection Pursuit Discriminant

Exploratory Projection Pursuit

In Exploratory Projection Pursuit each hyperplane is assigned a measure (or index ) that indicates how interesting the structure is contained. In the work by P. Diaconis and D. Freedman has been shown that most structures in the hyperplanes normally distributed data are similar ( see Figure 1). Therefore, many metrics measure the distance of the structure in the hyperplane to a normal distribution.

Then automatically in sequence are calculated according to all the possible projections of the data onto a hyperplane which is reduced by one or more dimensions in comparison with the original data. Data points are identified as part of a structure of interest, these are removed from the analysis. The process is repeated with the reduced data set, until no structure is visible.

Indices

The multivariate data is transformed with a rule such that the mean values ​​of the variables are equal to zero and the variance -covariance matrix is the unity matrix. Then, when the projection vectors for the hyperplane, the projected in the hyper-plane data, the density function of the standard normal distribution ( or the corresponding normal distribution if instead used ) and the density function of the projected data in the hyperplane, then was inter alia the following indices are then maximized proposed:

In principle, each of the test statistic, which is part of a test for normal distribution, are used as an index. Maximization leads to the hyperplane in which the data are not normally distributed. Special versions of the indices, and can be maximized by certain structures, such as The central hole or central mass.

The unknown density function of the projected data is estimated either by using a kernel density estimator or by an orthonormal function expansion.

Related methods

As special cases of Exploratory Projection Pursuit, you can

  • Consider the Grand Tour, in which the structures are discovered by the viewer even in the graphics, and
  • Principal component analysis, in which the index is described by.

Projection Pursuit Regression

In the regression case, the unknown regression function is iteratively represented by regression functions on the projected data:

Projection Pursuit Density Estimation

Even in the case of density estimation, a iterative method is used. The unknown density function is approximated as the product of density functions of the projected data:

With the density function of the multivariate normal distribution with parameters and estimated from the data. Then the normal distribution density is gradually corrected. In contrast to the regression case, however, the algorithm is much more complicated, since there are no observations available of which can be customized.

662121
de