Discriminant function analysis

Discriminant analysis is a method of multivariate methods in statistics and is used to distinguish between two or more groups (including variables) with several features described. They can examine groups on significant distinctions of their characteristics and for designate appropriate or inappropriate features. She was described in 1936 by RA Fisher for the first time in The use of multiple measurements in taxonomic problems.

Used discriminant analysis in statistics and in machine learning in order to achieve transformation through space a good representation of features, and is used as a classifier ( discriminant ) or dimension reduction. The discriminant is related to the principal component analysis (PCA ), which will also provide a good representation possibility observed but, unlike these class membership of the data.

  • 2.1.2.1 In the same variances
  • 2.1.2.2 Large intergroup variance
  • 2.1.2.3 Small intragroup variance
  • 2.1.3.1 example
  • 3.1 Example

Problem

We consider objects that belong to exactly one of several similar classes respectively. It is known which class belongs to each individual object. At each object occurrences of features are observed. From this information, linear boundaries are to be found between the classes to later objects whose class membership is unknown, to be able to assign one of the classes. The linear discriminant analysis is thus a classification method.

Examples:

  • Borrowers can be divided into creditworthy and not creditworthy for example. When a bank customer applies for a loan, the Institute on the basis of characteristics such as level of income, number of credit cards, length of employment at the last place of work, etc. trying to close on the future ability to pay and willingness to pay of the customer.
  • Customers of a supermarket chain can be classified as brand Noname buyers and buyers. Eligible characteristics such as the annual total expenditure in these stores, the proportion of branded products to the expenditure would be etc.

On this object at least one random metric scaled feature X can be observed. This feature is interpreted in the model of discriminant analysis as a random variable X. There are at least two different groups ( populations, populations ). From one of these populations, the object comes from. By means of a mapping rule, the classification rule, the object of one of these populations is assigned. The classification rule can often be indicated by a discriminant function.

Classification with known distribution parameters

For a better understanding of the procedure is explained with reference to examples.

Maximum likelihood method

A method of allocating the maximum likelihood method: It assigns the object to the group, whose likelihood is greatest.

A feature - Two groups - the same variances

Example

A nursery has the option of a larger quantity of seed of a given variety, cloves to acquire cheap. In order to eliminate the suspicion that this is old, overlaid seed, a seed sample is made ​​. So you sow from 1 g of seeds and counts how many of these seeds germinate. From experience it is known that the number of germinating seeds per 1 g of seed is approximately normally distributed. In fresh seeds ( Population I ), on average 80 seeds germinate in old ( Population II ) there are only 40 seeds.

  • Population I: The number of fresh seeds that germinate is distributed as
  • Population II: The number of old seeds that germinate is distributed as

The seed sample has now

Result. The graph shows that in this sample, the likelihood of the population I is greatest. So you assigns them a seed sample as fresh.

From the graph it can be seen that as the classification rule ( decision rule ) you can also specify:

The intersection of the distribution densities ( at x = 60) as corresponding to the decision boundary.

Desirable properties of the distribution characteristics

Equal variances

The characteristics of the two groups should have the same variance. In various variances, several mapping possibilities.

In the graph above two groups with different variances are shown. The flat normal distribution has a larger variance than the narrow, high. It can be seen as the variance of the normal distribution of the Group I Group II " makes ". If, resulting in the sample, for example, x = 10, you would have to classify the seeds as fresh as the probability density for group I is greater than for group II

In the "standard model" of the discriminant is assumed equal variances and covariances.

Large inter-group variance

The variance between the group means, the inter-group variance should be large, because then not mix the distributions: the separation of the groups is sharper.

Small intra- group variance

The variance within a group, the intra- group variance should be as small as possible, then the distributions do not mix, the separation is better.

Several features - Two groups - the same covariance matrices

The object of interest may have several observable characteristics. Obtained here as a model distribution structure a random vector X. This vector is distributed with mean vector μ and the covariance matrix Σ. The concrete realization is the feature vector x, whose components include the individual characteristics xj.

Two groups are arranged in the same way as above, the observed object to the group, in which the distance of the feature vector x to the expectation vector minimally. Used partially something is here, reshaped, the Mahalanobis distance as a distance measure.

Example

In a large amusement park, the output behavior of visitors is determined. In particular, we are interested in whether visitors are spending the night in the park's hotel. Each family total expenditure incurred to 16 clock ( feature ) and expenses for souvenirs ( feature ). The marketing department knows from long experience that the corresponding random variables and jointly normally distributed with the variances are approximately 25 [ € 2] and the covariance Cov12 = 20 [ € 2]. As for the hotel reservations can be the consumers in their spending behavior into two groups I and II divided, so that the known distribution parameters can be listed in the following table:

For group I, therefore, the random vector with multivariate normal distribution with mean vector

And the covariance matrix

For group II, same applies.

The populations of the two groups are indicated in the following diagram as dense point clouds. The expenses for souvenirs are known as luxury spending. The pink dot stands for the expectation values ​​of the first group of light blue for the group II

Another family has visited the amusement park. She has spent a total of 65 clock to 16 € and € 35 for souvenirs (green dot in the graph). Should we keep ready a hotel room for this family?

A look at the graph suggests, that the distance of the green dot to the expected value vector of group I is minimal. Therefore suspects the hotel management that the family will take a room.

For the Mahalanobis distance

Of the feature vector x to the center of the group I calculated it

And from x to the center of the group II

Several features - Several groups - the same covariance matrices

There may be more than two populations underlie the analysis. Here, too, similarly to the above, the object arranged in the population to which the Mahalanobis distance of the feature vector x to the expectation vector minimally.

( Fisher ) discriminant

In practice, it is laborious to determine at each feature to be classified, the Mahalanobis distance. One is the assignment by means of a linear discriminant function. Based on the decision rule

Results by reshaping this inequality decision rule using the discriminant function f (x):

The discriminant function is calculated in the case of two groups and equal covariance matrices as

The resulting discriminant function also as an empirical approach, if one maximizes the variance between groups and minimizes the variance within the groups. This approach is called Fisher'sche discriminant because they RA Fisher has been presented in 1936.

Bayesian discriminant

So far it was assumed that the assumption that the groups are of equal size in the population. But this is not usually the case. You can view the membership of a group as random. The probability of belonging to a group of objects is referred to as a priori probability. When groups linear Diskriminanzregel based on the assumption is that in multivariate normally distributed with mean and covariance matrix group that is the same in all groups, ie. Bayes' rule for linear discriminant analysis (LDA) is then

Wherein the call costs that arise when an object that belongs to group I, is erroneously assigned to group j.

Taking in the above model is not that the covariance matrices are identical in the groups, but that they may differ, that is, this is the Bayes rule for the quadratic discriminant analysis ( QDA )

The limits for implementation of the linear discriminant analysis are linear in square, in the square.

See also: Bayesian classifier

Classification with unknown distribution parameters

Most of the distributions of the underlying features will be unknown. So you need to be estimated. It takes two groups called a learning sample consisting nI or nII. With this data the expected value μk vectors (i = I, II) and the covariance matrix Σk can be estimated. As above, using the Mahalanobis distance or the discriminant function with the estimated parameters instead of true.

Judging from the standard model group with the same covariance matrices from the equality of the covariance matrices must be confirmed only with the help of Boxschen M tests.

Example

Amusement park example above:

The population is now unknown. There were examined in each group of 16 families. There were in the sample, the following values:

The mean values ​​for each group, the overall mean, the covariance matrices and the pooled ( combined ) covariance calculated as follows:

This is obtained according to the above formula, the discriminant

The classification rule is now:

To check the quality of the model, one can classify the sample values ​​. The result is here the classification matrix

Now back to the family with the observations; are classified (65 35).

The following graph shows the scatter plot of the training sample with the group means. The green dot is the location of the object ( 65, 35 ).

Even from the graphic you can see that this object belongs to Group I. The discriminant results

Because

Is assigned one to the object of group I.

Other tags

  • Wilks lambda
  • Flexible discriminant analysis
  • Kernel density estimator
  • Support vector machine
291802
de