Coefficient of determination

The coefficient of determination ( or cast off, also coefficient of determination) is a measure of the statistics for the stated proportion of the variability (variance) of a dependent variable based on a statistical model. Indirectly, this also the relationship between the dependent and / the independent variables measured (see error reduction measures).

Only in the case of a linear regression model, ie, there is a clear definition: the square of the multiple correlation coefficient. Otherwise, mostly (see pseudo - coefficient of determination ), there are several different definitions.

  • 2.1 Definition
  • 2.2 Design

The coefficient of determination

Interpretation

The measure is the proportion of variation of (or the variance of, as is true ), which is explained by the linear regression, and therefore lies between

If, then, there is the "best" linear regression model, only the constants, all other coefficients are zero. Is, then the variable can be fully explained by the linear regression model.

Construction

The variation of is decomposed into the variation of the residuals (not explained by the model variation ) and the variation of the regress values ​​( variation explained by the model ):

With the average of the ' s, the estimated values ​​from the regress regression model (). This follows a two-step

1

2 If the residuals are then applies

Thus, the coefficient of determination is defined as:

With this equation, the limits can be shown for. is at most equal to 1 when. This is the case if, for each observation pair, ie that all observation points of the scatter diagram lie on the regression line and residual values ​​are thus equal to 0. however, takes the value 0 if or is. This condition indicates that the non- explained variation of the total corresponds to the explanatory variation. The regression equation did not explain in this case. As a result, it follows:

In the literature, there are also the following notation for the

  • Variation of: (total sum of squares ),
  • Variation of the residuals: (sum of squared residual ) and
  • Variation of the regress values: ( estimated sum of squares ).

Connection with correlation coefficients

In a simple regression (one independent variable) corresponds to the square of the Pearson's correlation coefficient and can be calculated from the covariance and the individual variances and calculate:

In a multiple regression (more than one independent variable) corresponding to the square of the multiple correlation coefficients, ie, the correlation between and.

Example

The following example will show the calculation of the coefficient of determination. There were randomly selected ten warships and two features, length (m ) and width ( m ) is analyzed. The scatter plot shows that between the length and width of the vessel is obviously a linear relationship:

Ie the width of the selected warships corresponds roughly one sixth of the length.

The average value of width is m, the variation is equal to m and the variation of the residuals m². Therefore, the coefficient of determination results to

Ie, about 92% of the variation in width of the selected warships can be explained with the aid of the length of the selected warships. Only about 8% of the variation in the width remain unexplained, that here you could, for example, search for factors that influence the width of a warship.

With the estimate of the standard deviation of the residuals, the quality of the regression could be estimated:

For comparison, however, the knowledge of the variation of the y values ​​is necessary. When normalized coefficient of determination can, without knowledge of the variation of the Y values ​​, see because of the value of 92 % that the linear regression is very good.

Boundaries and criticism

  • The coefficient of determination, although it is not the quality of the linear approximation, however, whether the model has been correctly specified. Models were estimated using least squares, will therefore receive the highest.
  • Common misconceptions are: A high allows a good prediction. The pink data in the graph on the right suggests that changes the direction of the data for higher values ​​of.
  • A high indicates that the estimated regression line is a good approximation to the data; the red data suggest otherwise here.
  • A close to zero indicates that there is no relationship between the dependent and independent variables. The blue data in the graph on the right show a significant, but non-linear (ie, quadratic ) relation, although equal to zero. A more complete picture can get here by calculating nonlinear regressions, then capture even those relationships.

The corrected coefficient of determination

Definition

The coefficient of determination has the property that it becomes greater the more the larger is the number of independent variables. And regardless of whether other independent variables really make a contribution to explanatory power. Therefore, it is advisable (also called adjusted, or adjusted -adjusted coefficient of determination ), the corrected coefficient of determination should be consulted. It is calculated as follows:

Here, the explanatory power of the model, represented by, balanced with the complexity of the model, represented by the number of independent variables. The more complex the model is, the more " punished" every newly commissioned independent variable.

The adjusted coefficient of determination increases only when sufficient increases in order to compensate the opposing effect of the ratio, and may also decrease. In this way can be used (such as a restricted and unrestricted model a ) as a decision criterion in choosing between two alternative model specifications.

The corrected coefficient of determination can also assume negative values ​​and is smaller than the unadjusted, except if, then also.

Construction

From the above definition it follows that

However, we know that and no unbiased estimators for the variances are. If, up and down unbiased estimators a, we obtain the corrected coefficient of determination:

Pseudo - coefficient of determination

In case of ( linear ) regression with a metric dependent variables, the variance of is used to describe the quality of the regression model. For a nominal or ordinal scale level of however, there is no equivalent, since one one can not calculate the variance and thus. For these various pseudo- coefficient of determination have been proposed.

Forecasting coefficient of determination

While the coefficient of determination, the corrected coefficient of determination or the pseudo - coefficient of determination to make a statement about the model quality, the prediction coefficient of determination is aimed at the prediction quality of the model. In general, the prediction determination will be less than the coefficient of determination.

First, the PRESS value (English: Predicted Residual Error Sum of Squares ) is calculated

Is the observed value and the value that results as an estimate of when all the observations included in the regression model except the ith. That is, to calculate the PRESS value would be calculated linear regression models, each with observations.

It can be shown, however, that the residual from the regression residuals can be calculated (using all observations).

The prediction coefficient of determination is then given as

With the average of all y values ​​.

30264
de