Cohen's kappa

Cohen's Kappa is a statistical measure of inter-rater reliability of estimates of (usually ) two raters ( raters ), the 1960 Jacob Cohen suggested. However, this measure can also be used for the intrarater reliability, at the same observer at two different time points, applying the same method of measurement.

The equation for Cohen's Kappa is

Wherein the measured value of the two matching estimator and the random match is expected. If the raters agree in all their judgments, is. Provided that only matches between the two raters can be identified which correspond mathematically the degree of chance, it takes on a value of. ( Negative values ​​indicate, however, for a match that is still smaller than an accidental coincidence. )

Greve and Wentura (1997, p 111 ) report different estimates as to the value. As a summary it can be stated that values ​​from 0.40 to 0.60 might still be acceptable, but values ​​should be viewed with some skepticism below 0.40. Interrater reliability scores of> = 0.75 seem good to excellent.

Landis and Koch ( 1977), however, suggest the following values: < 0 = " poor match ( poor agreement) ", 0-.20 = " something (slight ) match", .21 to .40 = " sufficient ( fair) agreement ", .41 to .60 =" mediocre (moderate ) agreement ", .61 to .80 =" substantial ( substantial ) agreement, " 0.81-1.00 =" ( almost) perfect ( (almost ) perfect ) consistency ".

The problem with the coefficients in particular, that its maximum value is not always automatically is 1.00 (see below).

Nominal scales, two raters

If only matches and mismatches between the two raters are checked, all fall occurring assess differences equal to the weight. This is especially useful for nominal scales. Here, the data material ( ie the judgment frequencies ) are removed at an item or characteristic ( nominal ) categories of two Einschätzern in a contingency table (ie rows and columns ):

Then for the proportion of matching estimates of the rater ( = middle diagonal of the contingency table ):

,

Which equals the number of total assessed evaluation objects ( persons / items / objects ) corresponds.

For the expected matches the products of the marginal totals are ( = row total x column total ) are summed by category and finally set in proportion to the square of the total:

.

Scott (1955 ) proposed for its coefficient by the same initial formula is calculated as before, to determine the expected matches as follows:

.

If the marginal distributions are different, Scott is always larger than Cohen's.

Once in the contingency table and only one cell beyond the diagonal is filled (ie assessing differences occur ), the maximum value of Cohen's kappa depends on the marginal distributions. He is smaller, the further removed the marginal distributions of a uniform distribution. Brennan and preacher (1981 ) propose here a corrected kappa value than, the number of categories (ie, the characteristic values ​​) is as defined above. Thus reads:

The expansion of the formulas to more than two raters is problematic in principle. The extension of the statistics is also called Fleiss ' Kappa. Then for the proportion of matches that occurred, for example, for three raters and.

For the coefficient of Brennan and preacher (1981 ) suggests Eye (2006, p.15) the following extension to raters before: being an index of the matching cells ( diagonal ) is.

If above is the number of categories () and the number of raters (= number of assessments per feature / Item / person) and the number of total assessed evaluation objects (cases / individuals / Items / Items ), the following applies:

  • Is the number of raters, the appraisal object has judged appropriately in class.
  • Is the sum of all cases in assessment category.
  • Is the proportion of all cases in rating category at all () reviews overall.

The extent of interrater agreement when. Case ( = in the. Person / item / object ) is then calculated as

In the formula the average of all a and the expected value for the random flows:

.

Example

The following calculation example ( from the English article) raters judge each case on a scale of categories.

The categories can be found in the columns, the cases in the rows. The sum of all assessments.

For example, in the first column

And in the second line,

Thus, for

And

( That is so similar is a coincidence. )

Mehrfachstufung of the measurement objects, two raters

Are the raters but prompted the estimation objects multiple times to gradually (ie instead of k nominal categories, there is now a matter gradations and may be accepted at least one ordinal scale level for these gradations ) should unconformably larger deviations of the raters from each other more significant fall than smaller deviations. In this case, a weighted Kappa should be calculated, in which for each cell ij of the contingency table, a weighting factor is defined, which could be based, for example, about how large is the deviation from the center diagonal ( eg as squared deviations means diagonal cells = 0, deviations around 1 category = 1, deviations to 2 Categories == 4, etc.). Then for this (weighted ) kappa ( cf. Bortz 1999):

Alternatives to this are coefficients of the Spearman rank correlation coefficient and Kendall's tau and Kendall's coefficient of concordance W.

Cardinal Scale Kappa

This weighting idea goes on and on, too: At interval - scale level, the extent of the difference is (or similarity) between the emitted estimates even directly quantifiable ( Cohen 1968, 1972). The weighting values ​​for each cell of the contingency table then based in each case on the maximum and minimum difference.

To Cardinal Scale For that identical estimates (or the minimum difference between observers ) standardized with the value 0 and the maximum observer difference to be weighted with a value of 1 (and the other observed differences in their respective relationship to it):

And for the [ 0,1] - standardization of weights:

.

196557
de