Principal component analysis

Principal component analysis ( the mathematical method is also known as a main axis transform or Singular Value Decomposition ) or English Principal Component Analysis (PCA ) is a method of multivariate statistics. It is used to structure large sets of data, simplify and illustrate by a variety of statistical variables by a smaller number of meaningful possible linear combinations (the "Principal Components") is approached. Especially in the image processing, the main component analysis is also called the Karhunen - Loeve transform. They must be distinguished from the factor analysis, with which it has formal similarity and in which they can be used as an approximation method for extracting factors. ( The difference between the two methods is explained in item factor analysis. )

There are various generalizations of the PCA, for example, the principal Curves that principal surfaces or the kernel PCA.

  • 3.1 idea 3.1.1 Best linear approximation to the data set
  • 3.1.2 Maximizing the variance
  • 4.1 associated with the multidimensional scaling
  • 5.1 Original Articles
  • 5.2 Textbooks


The principal component analysis was introduced by Karl Pearson in 1901 and further developed in the 1930s by Harold Hotelling. Like other statistical analysis it became widespread only with the increasing availability of computers in the third quarter of the 20th century. The first applications were from the biology.

Concept of principal component analysis,

The underlying data typically has a structure of the matrix: to subjects or objects respectively characteristics were measured. Such a record can be illustrated as a set of points in -dimensional space. The aim of the principal component analysis is to project these data points in a so - dimensional subspace () that while a minimum of information is lost and this redundancy in the form of correlation is summarized in the data points.

Mathematically, conducted a principal axis transformation: It minimizes the correlation of multi-dimensional features by conversion into a vector space with a new base. The principal axis transformation can be specified by an orthogonal matrix, which is formed from the eigenvectors of the covariance matrix. Principal component analysis is thus a problem dependent, since its own transformation matrix needs to be calculated for each record. The rotation of the coordinate system is carried out so that the covariance matrix is ​​diagonalized, i.e. the data are decorrelated ( the correlations are non - diagonal entries of the covariance matrix ). For normally distributed data, this means that the individual components of each data record by the PCA are statistically independent from each other as the normal distribution by the zeroth (normalization ), first (average ) and second torque ( covariances ) is fully characterized. If the records are not normally distributed, the data even after the PCA - although now decorrelated - still be statistically dependent. The PCA is thus only for normally distributed data an "optimal " method.

Example of use

Are considered artillery ships of the Second World War. They are divided into classes battleships, heavy cruisers, light cruisers and destroyers. There are data for 200 ships. There were the characteristics of length, width, displacement, draft, performance of the machines specified speed ( longer term possible top speed ), radius of action and team strength. Actually, the characteristics length, width, displacement and draft measure all similar facts. One could speak of a factor " size " here also. The question is whether other factors determine the data. There are actually still a second significant factor, which is determined primarily by the performance of the machine and the speed limit. You could call him to a factor " speed " to summarize.

Other examples of applications of the principal component analysis are:

  • Applying the principal component analysis on the purchasing behavior of consumers in, there may be latent factors such as social status, age or marital status, motivate certain purchases. Here you could channel the desire to buy accordingly through targeted advertising.
  • One has a statistical model having a large number of features, the number of variables in the model can be reduced by means of principal component analysis if necessary, which usually improves the quality of the model.
  • Common applications for the principal component analysis in the image processing - especially in the case of remote sensing. Here, one can analyze satellite images and draw conclusions from it.
  • Another area is the artificial intelligence, together with the neural networks. Here, the PCA is used for the separation characteristic in the context of automatic classification or the pattern recognition.



The data is in one -dimensional Cartesian coordinate system as a point cloud.

Best linear approximation to the data set

The calculation of the principal components may be considered as iterative process. In the right graph for the data points ( open circles ) is the straight line sought that best approximates the data. The error is the sum of the Euclidean distances between the straight line and the data points. For the data point on the top right of the error is the red line that is precisely perpendicular to the black line.

This error is calculated for all lines passing through the center (mean ) of the data, the thick black dot run. The straight with the smallest error is the first principal component.

Thereafter, a further straight searched which passes through the center of the data orthogonal to the first straight line is the second main component. In the case of 2-dimensional data, this is simply the perpendicular to the first principal component line. Otherwise, the accumulated distance between the data points and the plane which is spanned by the two straight lines, must be minimal again. The second line corresponds to the second main component. Thereafter, a third, fourth to th line is wanted, then forming the third, fourth, up -th main component.

Maximizing the variance

The distance between the center of the data and a data point is independent of what is considered straight through the center as the "reference " (see the red line from the center of the data to the data point at top right). Using the Pythagorean theorem we can but the distance broken down into the component in the direction of the black line and a further share at right angles thereto. A minimization of the distances perpendicular to the line (retaining the distance to the data center, length of the red line ) therefore means maximization of the distances in the direction of the black line ( must be maintained ). The summed squares of the distances in the direction of the black line form the variance of the data in this direction.

Total variance:

This leads to the following algorithm: the first axis is to be set as the point cloud, the variance of the data in this direction becomes maximum. The second axis is perpendicular to the first axis. In its direction, the variance in the second largest, etc.

For dimensional data, there are thus in principle axes which are perpendicular to each other, they are orthogonal. The total variance of the data is the sum of these " axis variance ". With the axes of a new coordinate system is now placed in the point cloud. The new coordinate system can be represented as a rotation axis of the variable.

Is now the first () axes, a sufficiently large proportion of the total variance covered appear the main components, which are represented by the new axes, sufficient for the information content of the data. The total variance in the data is thus a measure of information content.

Often the major components content can not be interpreted. In statistics, speaks of them can be attributed to no intelligible hypothesis (see Factor Analysis ).

Statistical model

Is considered random variables, which are centered with respect to their expected values ​​. That is, their expectation values ​​were subtracted from the random variables. These random variables are combined in one -dimensional random vector. This has the expected value vector to the zero vector and the covariance matrix, which is symmetric and positive definite. The eigenvalues ​​, of the matrix are sorted in descending order by size. They are listed as diagonal elements in the diagonal matrix. Belonging to them eigenvectors form the orthogonal matrix. It is then

If the random vector transformed linearly, then the covariance matrix is just the diagonal matrix.

For clarification, we consider a three-dimensional random vector

The matrix of the eigenvalues ​​of the covariance matrix is


The eigenvectors can be summarized as columns of the matrix:

The matrix -vector multiplication

Yields the equations

Is the variance of

So the main component has the largest proportion of the total variance of the data, the second largest share, etc. The elements; , Could be described as a contribution of the variables on factor. The matrix is defined in this context as a loading matrix, it indicates " how high loads a variable to a factor ".

Estimation of model parameters

Lying concrete data collected with characteristics before (that is, each data point is one -dimensional vector ), the sample correlation matrix is calculated from the characteristic values. This matrix is then determined from the eigenvalues ​​and eigenvectors for the principal component analysis. Since the covariance matrix is a symmetric matrix, a total of parameters are to be estimated for their calculation. This is only useful when the number of data points in the data set is much larger, that is, when. Otherwise, the determination of the covariance matrix is highly error-prone, and this method should not be applied.

Example with three variables

The above application example is now illustrated in figures:

We consider the variables length, width and speed. The scatter plots give an impression of the joint distribution of the variables again.

With these three variables was performed with the statistical software package SPSS principal component analysis. The charge matrix is

The factor thus is composed of

In particular, the contribution of length and width for the first factor is large. The second factor mainly the contribution of the velocity is large. The third factor is unclear, and probably irrelevant.

The total variance of the data is distributed as follows among the main components:

Thus, it will be the first two principal components already 97.64 % of the total variance of the data covered. The third factor contributes nothing worth mentioning in the information content.

Example with eight variables

It now features eight artillery ships a principal component analysis were subjected. The table of the loading matrix, here called "Component Matrix" shows that above all the variables length, beam, draft, displacement and manpower loading high on the first principal component. This component could be described as " size ". The second component is explained in large part by PS and nodes. It could be called " speed ". A third component loads still high on action radius.

The first two factors already cover about 84 % of the information to the vessel data, the third factor captured again about 10%. The additional contribution of the remaining components is irrelevant.

Application in cluster analysis and dimensionality reduction

Principal component analysis (PCA) is also frequently used in the cluster analysis, and to reduce the dimension of the parameter space, in particular when it does not have any representation ( model) of the structure of the data. It utilizes the fact that the PCA the (orthogonal ) coordinate system is rotated so that the covariance matrix is ​​diagonalized. In addition, the PCA to sort the order of the coordinate axes ( the principal components ) so that the first principal component contains most of the total scattering (total variance) in the data set, the second major component was the second largest share, etc. As illustrated by the examples in the previous section, you can usually the rear main components ( ie those which contain only a small proportion of the total scattering ) Delete replacement without causing a significant loss of information occurs.

The basic assumption for the use of the PCA to the cluster analysis and dimension reduction is as follows: The direction having the largest dispersion (variance) contain most of the information.

In this context, it is very important that this basic assumption is merely a working hypothesis, which does not always apply. To illustrate this fact, are two examples:

  • Signal Variance: The graph on the right titled " PCA signal Variance " shows an example in which the assumption is true. The data set consists of two clusters ( red and green), which are clearly separated. The scatter of the data points within each cluster is very small compared to the " distance" of the two clusters. Corresponding to the first principal component will be x_1. Also it is clear that the first principal component x_1 is sufficient to separate the two clusters from one another, while the second main component to x_2 contains no useful information. The number of dimensions can thus be reduced from 2 to 1 ( by neglect of x_2 ), without that it would lose essential information about the two clusters. The total variance of the data set is thus dominated by the signal (two separate clusters).
  • Noise Variance: The graph on the right titled " PCA Noise Variance " shows an example in which the assumption is not correct and the PCA can not be used for dimension reduction. The dispersion within the two clusters is now significantly larger and carries the major part of the total scattering. Assuming that the scattering caused by noise within the cluster, called in this case noise variance. The first principal component will be x_2, which does not include any information about the releasability of the two clusters.

These two examples show how to use PCA to reduce the dimension and cluster analysis or that this is not always possible. Whether the basic assumption that the directions of the largest scattering really are also interesting, true or not, depends on each given data set and is often difficult to check - especially when the number of dimensions is very high and the data thus not can fully visualize more.

Connection with the multidimensional scaling

Both the multi-dimensional scaling as well as the principal component analysis compress the data. If in the ( metric ) multidimensional scaling Euclidean distances used and is the dimension of the configuration is equal to the number of principal components, both methods give the same solution. This is because that the diagonalization of the covariance matrix ( correlation matrix or, if it is to work with standardized data ) corresponds to the principal component analysis of a rotation of the coordinate system. Thus, the distances remain between the observations which form the starting point in the multidimensional scaling, the same.

In the multi-dimensional scaling, however, other distances may be used; As such, the principal component analysis can be considered as a special case of the multi-dimensional scaling.