Negative predictive value

In a classification objects are classified on the basis of specific features by a classifier in different classes. The classifier makes it generally error, so classified in some cases, an object of an incorrect class. From the relative frequency of these errors to quantitative measures to assess a classifier can be derived.

Often the classification binary in nature, that is, there are only two possible categories. The quality measures discussed here refer only to this case. Such binary classifications are often formulated in the form of a yes / no question: If a patient has a particular disease or not? If a fire broke out or not? Approaching an enemy plane or not? For classifications of this type there are two possible types of errors: An object is assigned to the first class, although it belongs to the second, or vice versa. The characteristics described here then offer a way to assess the reliability of the associated classifier ( diagnostic procedures, fire alarms, aviator radar).

Yes-no classifications have similarities to statistical tests, it is decided in which between a null hypothesis and an alternative hypothesis.

3.1 from mutual influences
3.2 Rare positive cases
3.3 Incomplete truth matrix

4.1 classification score for assessing the quality of statistical tests
4.2 Statistical tests for assessing a classification

5.1 Accuracy Hit Ratio Chart
5.2 Example
5.3 Practice and Problems

6.1 HIV in Germany
6.2 heart attack in the U.S.

8.1 General
8.2 Information Retrieval

Confusion matrix: Correct and incorrect classifications

To evaluate a classifier, one needs to apply it in a number of cases where one has knowledge of the "true" class of the objects, at least in retrospect. An example of such a case is a medical laboratory test which is used to determine whether a person has a certain disease. Later, it is determined by more elaborate studies, whether the person actually suffering from this disease. The test represents a classifier, the "sick " and "healthy " classifies people into categories. Since it is a Yes / No question, we can also say the test is positive ( classification "sick" ) or negative ( classification "healthy" ) from. In order to assess how well suited the laboratory test for the diagnosis of the disease is, its actual condition will now be compared with the result of the test in each patient. There are four possible cases can occur:

In the first and last case, the diagnosis was therefore correct in the other two cases, there is an error. The four cases are also named differently in different contexts. So also the English terms are true positive, false positive, false negative and true negative common. In the context of signal detection theory true positive cases are also known as hit, miss and false negative cases as true negative cases as a correct rejection referred.

It is now counted how often each of the four possible combinations of test results ( determined class) and state of health (actual class) has occurred. These frequencies are entered into a so-called confusion matrix (also called confusion matrix ):

This matrix is a simple special case of a contingency table with two nominal variables - the judgment of the classifier and the actual class. It can also be used for classification using more than two classes, then for N classes from a 2 × 2 matrix is a N × N matrix.

Statistical criteria of the classification

By calculating the relative abundances of different parameters for the evaluation of the classifier can now be calculated from the values of truth matrix. This can also be interpreted as estimates of the conditional probability for the occurrence of the corresponding event. The dimensions differ according to the population to which the relative frequencies relate: Thus, for instance, only all the cases are considered, in which the positive or negative category actually exists (the sum of the entries in a column of the confusion matrix ), or considering the set of all objects that are classified as positive or negative (the sum of the entries of a row of the truth matrix). The election will have a serious impact on the calculated values , especially when one of the two classes occurs much more frequently overall than the other. In the pictures below the respectively considered populations are marked by red and green color, while the objects are drawn not reach into the gray areas in calculating the respective dimensions.

Sensitivity and false - negative rate

The sensitivity (also true- positive rate, sensitivity or hit rate, english sensitivity, true positive rate, recall or hit rate ) indicates the proportion of correctly as positive classified objects on the totality of the objects actually positive. For example corresponds to sensitivity at a medical diagnosis to the proportion of actual patients in whom the disease was also detected.

The sensitivity corresponds to the estimated conditional probability

According gives the false - negative rate (english false negative rate or miss rate ) the proportion of falsely classified as negative objects that are positive in reality, so in the example actually sick, but are diagnosed as healthy.

The false negative rate is the estimated conditional probability

Since both dimensions refer to the case that in reality the positive category is present ( first column of the confusion matrix ), the sensitivity and the false - negative rate to 1 or 100 % add.

Specificity and false-positive rate

The specificity (also true- negative rate or distinguishing characteristic; English: specificity, false negative rate or correct rejection rate ) indicates the proportion of correctly as negative classified objects to the entirety of the really negative objects. For example, the specificity of a medical diagnosis, the proportion of healthy persons, which was also found that no disease is present.

The specificity corresponds to the estimated conditional probability

According gives the false - positive rate (also failure rate; fallout English or false positive rate ) the proportion of falsely classified as positive objects that are negative in reality. In the example, then a Healthy actually would wrongly diagnosed as ill. So the probability is specified for a false alarm.

The false positive rate is the estimated conditional probability

Since both dimensions refer to the case that in reality the negative category is present (second column of the confusion matrix ), the specificity and the false - positive rate to 1 or 100 % add.

Positive and negative predictive value

The positive predictive value (including relevance, effectiveness, accuracy, positive predictive value, in English: precision or positive predictive value; Abbreviation: PPV) is the proportion of correctly as positive recognized results in the total number of detected as positive results to ( first row of the confusion matrix ). For example, the positive predictive value of a medical diagnosis, how many people, in whom the disease was found, also are actually sick.

The positive predictive value is the estimated conditional probability

Accordingly, the negative predictive value is (also Segreganz or separation ability, English: negative predictive value; Abbreviation: NPV) the proportion of correctly identified as negative results in the total number of detected as negative results on (second row of the confusion matrix ). In the example that corresponds to the proportion of actually healthy individuals, in whom the disease has not been established.

The negative predictive value is the estimated conditional probability

Other than the other pairs of quality measures, the negative and the positive predictive value do not add up to 1 or 100 %, since each took different cases (in fact, positive or actually negative, ie different columns of the truth matrix).

It should be noted that the positive and negative predictive value in a given population ( eg, population) is only meaningful if the frequency of positive cases, such as the prevalence of the complaint, in the collective and in the raised group matches. Example: 100 HIV patients and 100 healthy control subjects were to determine the positive predictive value being tested, the proportion of HIV patients in this group ( 50%) is far from the actual prevalence of HIV in the general population (0.08 %) removed ( see also the below mentioned numerical example ). The specification of forecast values that were collected in such a selective collective, is not permitted and misleading. In such cases it is useful to the likelihood ratio (LR) (positive LR = sensitivity / [1- specificity ]; negative LR = [1- sensitivity ] / specificity) indicate (not to be confused with the likelihood ratio test).

Correct and incorrect classification rate

The false classification rate (also size of the classification error ) indicates the proportion of all objects that are misclassified. The remaining part corresponds to the incorrect classification rate (also confidence probability). In the example, the diagnosis, the false classification rate, the proportion of false positive and false negative diagnoses in the total number of diagnoses, the incorrect classification rate would be, however, the proportion of true-positive and true-negative diagnoses.

The Correct classification rate corresponds to the estimated probability

And the false classification rate of the estimated probability

The correctness and the false classification rate add up according to 1 or 100%.

Combined Dimensions

Since the various quality dimensions influence each other (see section problems ), various combined measure have been proposed that allow an assessment of the quality of a single code. The presented in the following dimensions have been developed in the context of information retrieval (see application in information retrieval ).

The F- measure combines precision ( precision) and hit rate ( recall) by the weighted harmonic mean:

In addition to this as a designated degree, are equally weighted in the accuracy and hit rate, there are other weights. The general case, the dimension ( for positive values ):

For example, weighted hit rate twice as large as the precision and accuracy twice as high as the hit rate.

The measure of effectiveness (E) also corresponds to the weighted harmonic mean. It was introduced in 1979 by van Rijsbergen. The effectiveness is between 0 ( best efficiency ) and 1 ( poor effectiveness). For a parameter value of E is equivalent to the hit rate for a parameter value is equivalent to accuracy.

Problems

Mutual influences

It is not possible to optimize all the quality criteria independently. In particular, the sensitivity and the specificity are negatively correlated with each other. To illustrate these relationships, it is helpful to consider the extreme cases:

If a diagnosis almost all patients classified as ill ( liberal diagnosis), the sensitivity is maximum, because most patients will be recognized as such. However, also the false positive rate is a maximum, since almost all healthy people are classified as sick. Thus, the diagnosis has a very low specificity.
If, however, almost no one classified as ill ( conservative diagnosis), the specificity is inversely maximum, but at the expense of a low sensitivity.

How conservative or liberal, a classifier should be optimally depends on the specific application. From this is derived from, for example, which of the misclassifications, the schwererwiegenden consequences. When diagnosing a bad disease or security-related applications such as a fire alarm, it is important that no case remains undetected. In a search through a search engine, however, it may be important to get as few results that are irrelevant for the search, thus represent false-positive results. The risks of different misclassifications can be specified for evaluating a classifier in a cost matrix to which the confusion matrix is weighted. Another possibility is the use of combined degree in which a corresponding weighting can be adjusted.

To illustrate the impact of different conservative test for a concrete example, receiver - operating-characteristic curves ( ROC curves ) can be created in which the sensitivity for different tests is plotted against the false-positive rate. In the context of signal detection theory, one also speaks of a different set conservative criterion.

Rare positive cases

In addition, an extreme imbalance between actual positive and negative cases, the parameters distort, as is the case with rare diseases. For example, the number of participating on a test sick significantly less than that of healthy people, this generally leads to a low value in the positive predictive value (see the numerical example below cited). It should therefore be specified in this case as an alternative to the prediction values, the likelihood ratio.

This relationship is to be considered in various laboratory tests: Cheap screening tests are adjusted so that the smallest possible number of false negative results is present. The produced false positive test results are then identified by a ( more expensive) confirmatory test. A confirmatory test should always be carried out for serious diseases. This approach is even required for the determination of HIV.

Incomplete truth matrix

Another problem in assessing a classifier is that often not the entire truth matrix can be filled. In particular, is often the false negative rate is not known, for example, if in patients who received a negative diagnosis, no further tests are performed and a disease goes undetected, or if one actually relevant document is not found in a search because it does not has been classified as relevant. In this case, only classified as positive results can be evaluated, ie, only the positive predictive value can be calculated ( see also the below -mentioned numerical example ). Possible solutions to this problem are discussed in the Application Information Retrieval section.

Classification score and statistical test theory

Classification score for assessing the quality of statistical tests

Using the classification score the quality of a statistical test can be assessed:

Generates you many samples under the null hypothesis, the acceptance rate of the alternative hypothesis should correspond to the type 1 error. But for complicated tests can often only specify an upper limit for the type 1 error, so that the "true" type 1 error can be estimated only with such a simulation.
Generates you many samples under validity of the alternative hypothesis, the rejection rate of the alternative hypothesis is an estimate of the error 2 Art This is for example of interest, if you have two tests for a fact. If the alternative hypothesis is true, then it is preferable to the test, which has a smaller type 2 error.

Statistical tests for assessing a classification

One can use statistical tests to check whether a classification is statistically significant, ie, whether regarding the population 's assessment of the classifier is independent of the actual classes ( null hypothesis ) or significantly correlated with them ( alternative hypothesis ).

In case of multiple classes of the chi-square test of independence can be used therefor. It is checked whether the assumption of the classifier is independent of the actual classes or significantly correlated with them. The strength of the correlation is estimated by contingency.

In the case of a binary classification of the four fields test is used, a special case of the chi-square tests of independence. If you have only a few observed values , the Fisher exact test should be used. The strength of the correlation can be estimated using the Phi coefficient.

If not, the test of the null hypothesis from, it does not mean, however, that the classifier is good. It just means that he's better than ( random ) rates. A good classifier should also have a very high correlation.

In Diettrich (1998) five tests are examined for direct comparison of misclassification rates of two different classifiers: a simple two-sample t-test for independent samples, a two-sample t-test for paired samples, a two-sample t-test for paired samples with 10-fold cross-validation, the McNemar test and a two sample t - test for paired samples with 5-fold cross validation and modified variance calculation ( 5x2cv ). As a result of the investigation of quality and type 1 error of the five tests implies that the test 5x2cv best behavior, however, is very computationally intensive. The McNemar test is slightly worse than the 5x2cv test, but much less computationally intensive.

Application in Information Retrieval

A special application of the measures described here is to assess the quality of sets of hits a search in information retrieval. This involves assessing whether a the document, such as when web mining by search engines, according to a defined criterion is relevant. In this context, the above-defined names hit rate (English, recall '), accuracy (English, precision' ) and default rate are (English, fallout ' ) are common. The hit rate is the proportion of relevant documents found in a search, and thus the completeness of a search result. The accuracy describes the proportion of relevant documents in the result set, the accuracy of a search result. The ( less common ) failure to mean the proportion of irrelevant documents found on the total amount of all irrelevant documents, it is therefore in a negative way, as well irrelevant documents are avoided in the search results. Instead, as a measure hit rate, accuracy, and failure can also be interpreted as a probability:

Hit rate is the probability of a relevant document can be found (sensitivity ).
Accuracy is the probability of a the document is relevant (positive predictive value).
Loss is the probability that a document is found irrelevant ( false positive rate).

A good search should find all relevant documents as possible ( true positive ) and non-relevant documents not find ( true negative ). As described above, the various dimensions but depend on each other. In general, with increasing hit rate reduces the accuracy (more irrelevant results ). Conversely, decreases with increasing accuracy (less irrelevant results ), the hit rate ( more relevant documents that are not found). Depending on the application of the various measures for assessing more or less relevant. In a patent search, it is important, for example, that no relevant patents remain undetected - so the negative predictive value should be as high as possible. In other research, it is important that the number of hits contains little irrelevant documents, ie, the positive predictive value should be as high as possible.

In the context of information retrieval, the combined mass such as the F value and the effectiveness of the above described were introduced.

Accuracy hit ratio chart

In order to evaluate a retrieval procedure usually recall and precision are considered together. These (PR- graph ) can be entered on the ordinate accuracy and hit rate on the abscissa for different sizes of hits between the two extremes in the so-called precision-recall graph. This is especially easy in the process possible, their hit rate can be controlled by one parameter. This chart serves a similar purpose as the above ROC curve, which is referred to in this context as hit rate fallout diagram.

The (highest ) value in the diagram, where the precision value is equal to the hit value - ie the intersection of the accuracy hit ratio diagram with the identity function - is called the accuracy hit ratio break-even point. Since both values are interdependent, it is often with a fixed value called the other one. However, an interpolation between the points is not allowed, are discrete points whose spaces are not defined.

Example

In a database with 36 documents are to a search query relevant documents 20 and 16 are not relevant. A search returns 12 documents, of which 8 are actually relevant.

Recall and precision for the specific search result from the values of the confusion matrix.

Hit rate: 8 / ( 8 12 ) = 8/20 = 2/ 5 = 0.4
Accuracy: 8 / (8 4) = 8/ 12 = 2/3 ≈ 0.67
Fallout 4 / (4 12) = 4/ 16 = 1 /4 = 0.25

Practice and problems

A problem in calculating the hit rate is the fact that you rarely know how many relevant documents exist in total and were not found ( problem of incomplete truth matrix). For larger databases, where the calculation of the absolute hit ratio is particularly difficult working so with the relative hit rate. The same search is performed with multiple search engines and adds each new relevant hits to the relevant documents not found. With the recapture method can be estimated how many relevant documents exist in total.

Another problem is that the relevance of a document for the determination of recall and precision as the truth must be known (yes / no). In practice, however, is often the Subjective relevance of importance. Also for gerankte of hits the disclosure of recall and precision is often not sufficient, since it depends not only on whether a relevant document is found, but also whether it will be ranked in comparison to non-relevant documents sufficiently high. In case of very differently sized sets of hits specifying average values for recall and precision can be misleading.

Further application examples

HIV in the FRG

The goal of an HIV test should be the safest possible detection of an infected person. But what are the consequences to a false positive test, shows the example of a man who can be tested for HIV and then commits suicide because of a false-positive result.

Assuming an accuracy of 99.9 % of the non- combined HIV testing for both positive and negative results (sensitivity and specificity = 0.999 ) and the current spread of HIV (as of 2009 ) in the German population ( 82 million inhabitants, of which 67,000 HIV - positive) would be a common HIV test devastating. Therefore, a confirmatory test is required in Germany before informing the patient with a positive result. With a positive ELISA screening test, a Western blot confirmatory test is performed, so that combined a sensitivity of 0.999 and a specificity of 0.999996 is reached.

In HIV- test non- combined, although of 67,000 patients actually would not be recognized only 67 HIV -infected falsely, but about 82,000 people would mistakenly diagnosed as HIV -positive. From 148 933 positive results about 55 % were false positive, ie more than half of the women tested positive. Thus, the probability that someone who would test positive with the ELISA test, and would really be HIV-positive, with only 45%. In other words, the positive predictive value is 45%. This, given the very low error rate of 0.1% low value lies in the fact that HIV occurs in only about 0.08 % of the population.

Heart attack in the U.S.

About four million men and women are admitted because of pain in the chest, under the tentative diagnosis of heart attack in a hospital per year in the United States. In the course of laborious and expensive diagnostics, then it turns out that these patients only about 32 % have actually suffered a heart attack. In 68 % the diagnosis of infarction was not correct ( false positive diagnosis ). On the other hand, approximately 34,000 patients are discharged from the hospital each year, without an actually existing myocardial infarction was detected (about 0.8 % false negative diagnosis).

In this example, the sensitivity of the examination is similarly high, namely 99.8 %. The specificity can not be determined because the false-positive results of the investigation are not known. Are known only to the false-positive diagnoses input, the " heartbreak " are based on the specification. Considering only these initial diagnosis, then the specification of the 34,000 patients who are discharged erroneously, worthless, because they have nothing to do with. In fact we need the number of false negatives, so that people with heart attack who were not admitted because they had no heartbreak.

One should always be careful not to exploit such mixed information, and definitely pay attention to a precise formulation of the thesis.

Classifier (mathematics) Likelihood-ratio test Sample (statistics) McNemar's test Relevance (information retrieval) Medical error Digital Object Identifier

52985