Spearman's rank correlation coefficient

A rank correlation coefficient is a parameter- free measure of correlation, which means that it measures how well an arbitrary monotonic function can describe the relationship between two variables, without making any assumptions about the probability distribution of the variable.

Different from Pearson's correlation coefficient it does not require the assumption that the relationship between the variables is linear. The rank correlation coefficient is robust against outliers.

There are two well-known rank correlation coefficients: Spearman's rank correlation coefficient ( Spearman's rho ) and Kendall's Tau. To determine the agreement between multiple observers ( interrater reliability ) on ordinal scale is, however, resorted to the related with the rank correlation coefficient of concordance coefficient W Kendall.

  • 3.1 Calculation
  • 3.2 Test of Kendall's tau
  • 3.3 Further τ coefficient
  • 4.1 estimation procedures for the tetra-or polychoric correlation
  • 4.2 approximation formula for the correlation tetrachorische

Concept

We start with pairs of measurements. The concept of the non-parametric correlation consists of all other to replace the value of each measurement relative to the rank of the measurement, therefore. After this operation, the values ​​of a well-known distribution, namely a uniform distribution of numbers from 1 to originate. If the are all different, each number comes just once before. If some have identical values, they are the average of the ranks assigned to, they would have received if they had been slightly different. In this case, is spoken of bonds or ties. This averaged rank is sometimes an integer, sometimes a "half " rank. In all cases, the sum of all allocated ranks is equal to the sum of all integers from 1 to, viz.

Then exactly the same procedure with the performed and each value is replaced by its rank among all.

By replacing interval scaled values ​​by the corresponding ranks of information is lost. The application for interval scaled data can be useful but, as a non-parametric correlation is more robust than the linear correlation, resistant to unplanned errors and outliers in the data, as the median is more robust than the mean. Lying as data only rank rows, ie data on Ordinalniveau before, there is also no alternative to rank-order correlations.

Spearman's rank correlation coefficient

Spearman's rank correlation coefficient, named after Charles Spearman and is often associated with the Greek letter ρ (rho ) or - referred to as - in contrast to the Pearson product-moment correlation coefficient.

Ρ is essentially a special case of Pearson product-moment correlation coefficient, wherein the data is converted into ranks before the correlation coefficient is calculated as follows:

It is

In practice, a simpler formula for the calculation of ρ is mostly used, but which is correct only if all ranks just be used once.

The raw data are converted to ranks, and the difference between the ranks of both variables are calculated for each observation. ρ is given by:

With:

Are all ranks different, gives this simple formula exactly the same result.

For bindings

The formula is a little more complicated when identical values ​​of or (ie bonds) exist, but as long as not too many values ​​are identical, only small deviations:

With. This is the number of observations with the same rank; which is either for or for.

Examples

Example 1

As an example, height and weight of different people are to be investigated. The pairs of measured values ​​were 175 cm, 178 cm and 190 cm and 65 kg, 70 kg and 98 kg.

In this example, the maximum rank correlation: The data series of body size is ordered by rank, and the rank numbers of body sizes is also in the ranks of the body weights. A low rank correlation exists when as the body size during the data series is bigger, but the weight decreases. Then you can not " The worst man is the biggest" say. The rank correlation coefficient is the numerical expression of the relationship of two rankings.

Example 2

Where eight observations of two variables a and b:

To determine the rank for the observations of b, the procedure is as follows: first sorting by value, then the rank is awarded (ie, re-numbered ) and normalized, ie at the same values ​​, the average is calculated. Finally, the input sequence will be restored, so that the differences in the ranks may then be formed.

From the two data sets a and b, the following intermediate calculation results:

The table is sorted by the variable a. It is important that individual values ​​can divide a rank. In the series there is a twice " 3", and they each have the "average" rank (2 3 ) / 2 = 2.5. The same happens in the series b.

The smaller the sum of rank squares difference, the greater the Spearman's rank correlation. Significance is determined by comparing the result with the tabulated critical values ​​.

Example 2 with correction to Horn

And there is

Determining the significance

The modern approach to test whether the observed value of ρ is significantly different from zero leads to a permutation test. The probability is calculated that ρ is greater than or equal to ρ observed for the null hypothesis.

This approach is superior to traditional methods when the data set is not too large in order to generate all permutations necessary, and continue, if it is not clear how meaningful for the given application permutations for the null hypothesis generated ( but this is generally pretty easy is ).

Kendall's Tau

In contrast to Spearman's Kendall uses only the difference in the ranks and not the difference in the ranks. In general, the value of Kendall's is slightly smaller than the value of Spearman's. proves beyond for interval- scaled data is helpful when the data are not normally distributed, have the scales unequal divisions or with very small sample sizes.

Calculation

To calculate, we consider pairs of sorted according to observations and with and. Thus:

Then, the pair 1 with all of the following pairs ( ) is compared to the pair 2 with all of the following pairs ( ), etc. There are a total performed pairwise comparisons. Applies to a pair:

  • And, it is said concordant, or compliant,
  • And, it is said diskonkordant or disagree,
  • And so it is a bond in,
  • And so it is a bond in and
  • And so it is a bond in and.

The number of couples who

  • Are concordant or coincidentally, is with,
  • Diskonkordant or disagree, is with,
  • The bonds are in, is with,
  • The bonds are in is with and
  • The bonds in and is denoted by.

Kendall's values ​​compares the number of concordant and discordant pairs of:

Is Kendall positive, so there are more concordant pairs as diskonkordante, ie It is likely that if, then applies. Is Kendall negative, so there are more diskonkordante couples concordant, ie It is likely that if, then applies. The value of normalized Kendall's so true that:

Test of Kendall's tau

Considering the random variable, Kendall has been found that for the test

This is normally distributed under the null hypothesis approximate. In addition to the approximate an exact permutation tests also can be performed.

More τ coefficient

With the above definitions, Kendall had defined a total of three coefficients:

Kendall can be applied only to data without bindings. Kendall reached on non-square contingency not the extreme values ​​, respectively, and considered, as does not flow, no ties in and. In fourfold tables is the four fields coefficients ( Phi ) and, when the values ​​of the two dichotomous variables are each encoded with 0 and 1, with Pearson's correlation coefficient identical.

Tetra-and polychoric correlation

In connection with Likert scales, the tetra- ( binary case of two variables) or polychoric correlation is calculated many times. Here, it is assumed that for example in a question with the answer form ( Strongly Disagree to, ..., Meets completely ) the respondents had actually replied in a metric sense, but had to choose one of the alternatives to the response form.

That behind the observed variables are ordinal, so are unobserved interval scale variables. The correlation between the unobserved variables is called tetra-or polychoric correlation.

The use of tetra- or polychorischen correlation with Likert scales is recommended when the number of the categories of the observed variable is less than seven. In the practice of Bravais -Pearson correlation coefficient is often used instead to calculate the correlation, but it can be shown that this enhances the true correlation is underestimated.

Estimation methods for the tetra-or polychoric correlation

Assuming that the unobserved variables are pairwise bivariate normal distribution, one can use the maximum likelihood method estimate the correlation between the unobserved variables. There are two methods:

Approximation formula for the correlation tetrachorische

For two binary variables, using the crosstab right an approximate formula for the correlation tetrachorische be specified:

A correlation of exists if and only if. Accordingly, a correlation of exactly then is when.

471979
de