Linear regression

The linear regression is a special case of the general concept of regression analysis is used to attempt to explain a dependent variable by one or more independent variables - the adjective "linear" results here from the fact that the regression coefficients ( not necessarily the variables themselves! ) enter into this case in the first power in the regression model.

  • 2.1 Estimation of the regression coefficients
  • 2.2 Selected estimators
  • 2.3 Estimation and Testing 2.3.1 quality of the regression model
  • 2.3.2 Contribution of individual regressors to explain y
  • 2.3.3 forecast
  • 3.1 Application in econometrics

Simple Linear Regression

A special case of the regression models are linear models. This is called the simple linear regression, and the data are in the form. The model selects one

Thus, one assumes a linear relationship between and. The data are considered realizations of the random variables that are not stochastic, but measuring points. The aim of the regression analysis in this case is to determine the unknown parameters, and.

Assumptions

Thus the regression estimates can be inferentially analyzed, certain assumptions must be satisfied for the linear regression model:

1 With regard to the disturbance

2 The data matrix, which is explicitly stated in the section on multiple regression is fixed.

3, the data matrix has rank, the number of the regression coefficient indicating.

  • In the first acceptance so all have the same variance ( homoskedasticity ) and they are pairwise uncorrelated. We interpreted this to mean that the disturbance shall not contain any information and just randomly scattered. Therefore can only be explained by information from.
  • The second assumption holds constant.
  • The third assumption is required for a unique solution of the regression problem.

Example

Here, the simple linear regression is shown by way of example.

A renowned champagne producer wants to bring a high-quality Riesling sparkling wine on the market. For the determination of the selling price initially a price-demand function to be determined. For this purpose, a test sale is conducted in stores, and you get six pairs of values ​​with the respective retail price of a bottle ( in euros ) and the number of bottles sold in each case:

As a scatterplot of price and remote amount of champagne bottles results in the following graphic.

Calculation of the regression line

We start with the following statistical model:

One considers two variables and that probably about in a linear relation

. stand On the assumption of the linear relationship you get when you look at the above scatterplot, there can be seen that the registered points almost lie on a line. In addition, defined as an independent and a dependent variable. There are of and the observations and, where from 1 to go. The functional relationship between and can not be accurately determined because is overlaid by a disturbance. This disturbance is designed as a random variable ( the population ), representing the nichterfassbare influences ( human behavior or measurement inaccuracies or similar). This results in the model

And there are not known, it can not be decomposed into the components and. Furthermore, it should be a mathematical estimate of the parameters and through and found so that results

With the residue of the sample. The residual is the difference between the regression line and the measured values. Further refers to the estimated value and it is for

There are several ways to estimate the line. You could put a line through the points swarm so that the sum of squares of the residuals, ie the vertical deviations of the points is minimized by this best-fit line. Plotting the true unknown and the estimated regression line in a common graphic, then results in the following figure.

This conventional method is the minimum -square method or the method of least squares. It minimizes the summed squares of the residuals,

And respect. By partially differentiating and setting to zero the first order derivatives one obtains a system of normal equations.

The regression coefficients are desired, the solutions

And

With the arithmetic mean of the values ​​and the arithmetic mean of the values. presents the empirical covariance between and dar. denotes the empirical variance of. They call these estimates are also least-squares estimator ( OLS ) or ordinary least squares estimator (OLS).

For the following numerical example results and. Thus we obtain the estimates of and by simply inserting into the above formulas. Intermediate values ​​in these formulas are shown in the following table.

This results in the example

The estimated regression line is thus

So that one can assume that for each euro drops more sales by an average of about one bottle.

Visualization and interpretation

Is emphasized again and again, as in the statistical literature, a high value of the correlation coefficient of two random variables, and alone is not sufficient evidence of the causal (ie causal ) relationship between and, nor for its possible direction.

Unlike commonly described, one should, therefore, have to do with the linear regression of two random variables, and always with not one but two independent regression lines: the first for the assumed linear dependence, the second for the equally possible dependence.

Refers to the direction of the x -axis as the horizontal and the y-axis as a vertical, the calculation of the regression coefficients makes therefore in the first case on the practice, certain minimum vertical square deviations addition, in the second case on the other hand to the minimum of the horizontal square deviations.

Outwardly considered, the two regression lines and a pair of scissors, the cut and center of gravity of the investigated point cloud is - the further opens this gap, the lower the correlation of both variables, to the orthogonality of the two regression lines, expressed numerically by the correlation coefficient 0 or cutting angle 90 °.

Conversely, the correlation of both variables all the more, the more the gap closes - for collinearity of the direction vectors of both regression lines eventually, so if both figuratively lie one above the other, takes depending on the sign of the covariance of the maximum value or at, which means that between and a strictly linear relationship and - mind you only in this one case - no need to calculate a second regression line.

As shown in the following table, the equations of the two regression lines have great formal similarity, such as regards their increases or which are the respective regression coefficients are equal and differ only by their common denominator: in the first case the variance of, in the second the by:

In addition, to recognize the mathematical center position of the correlation coefficient and its square, the so-called coefficient of determination, with respect to the two regression coefficients, thereby arising, that is the geometric mean of the variances in the denominator instead of or.

Considering the differences as components of an n- dimensional vector and the differences as components of an n- dimensional vector, the correlation coefficient can eventually as the cosine of the angle enclosed by the two vectors interpret:

Example in Brief

For the previous example, the following table sparkling wine showed:

And from the following values:

The estimated regression line with a coefficient of determination of approximately.

Multiple Regression

In the following multiple regression, starting introduced from the simple linear regression. The response depends linearly on several fixed predetermined covariates, thus we obtain the form

Which again represents the disturbance. So it is a random variable and is therefore a linear transformation of a random variable also. There are for, with, and the many observations before, so that for the observations, wherein the system of equations

Results. thus indicates the number of parameters to be estimated or the dimension of the Kovariablenvektors. Only the case was considered in the simple linear regression, multiple regression, starting it is now presented with as a generalization of this. As with the simple linear regression is constant equal to 1 in applications as sampling theoretical approach, each sample element is interpreted as a distinct random variable and each well.

Since this is a linear system of equations, which system elements can be combined in matrix notation. This gives the column vectors of the dependent variable and the disturbance as random vectors and the column vector of regression coefficients, where,

The data matrix is spelled out in

Furthermore, one meets, as mentioned in the section on simple linear regression, the assumptions

Thus true for

Furthermore, the system of equations can now represent considerably easier than

Estimation of the regression coefficient

Also in the multiple linear regression model, the residual sum of squares is minimized by the least squares method. Is obtained as the solution of a minimization problem the vector of estimated regression coefficients as

This estimator is for the Gauss-Markov theorem, the BLUE ( Best Linear Unbiased Estimator ), which is the best ( is unbiased with the smallest variance) linear unbiased estimators. The properties of the estimator is no disturbance of the distribution information must be present.

Obtained using the minimum -square estimator, the system of equations

Where the vector of residuals, and the estimate is for. The interest of the analysis lies in estimating or predicting the dependent variable for a given tuple. This is calculated as

Selected estimators

The estimated values ​​of are calculated as

Which one this is also shorter than

Can write. The matrix is idempotent and a maximum of rank. It is also Hat- matrix called because it touches the "hat ".

The residuals are calculated as the

Eluting with comparable properties.

The prognosis is determined as

Since is fixed, you can represent all of these variables as a linear transformation of and thus of, and therefore its expectation vector and covariance matrix can be determined their problems.

The sum of squares (of English. "Residual sum of squares " ) of the residuals results in matrix notation

This may furthermore also be written as

The variance is estimated using the residuals, as a mean residual sum of squares

Estimation and Testing

The information about the distribution of the disturbance still required for the inferential regression ( estimation and testing). In addition to the assumptions already listed earlier have been here as a further assumption:

4 The disturbance is normally distributed.

Together with the first assumption is obtained for the distribution of the vector of the disturbance:

Where the zero vector respectively. Here are uncorrelated random variables stochastically independent. Since the interest estimator for the most part are linear transformations of, they are also normally distributed with the appropriate parameters. Furthermore, the residual sum of squares is defined as non-linear transformation χ2 -distributed with degrees of freedom.

Sketch of proof: Let

Thus obtained

In which

Furthermore, it is also true

Consider this, the article determination.

Quality of the regression model

If you have identified a regression, one is also interested in the quality of this regression. It is often used as a measure of the quality of determination. In general, the closer the value of the coefficient of determination to 1, the greater the quality of the regression. If the coefficient of determination small, you can its significance by the hypothesis H0: R2 = 0 with the test statistic

Testing. F is F -distributed with and degrees of freedom. Exceeds the test statistic at a significance level α critical value, the (1- α ) - quantile of the F distribution with degrees of freedom and H0 is rejected. R2 is then sufficiently large, so probably carries enough information to explain much at.

Under the assumptions of the classical linear regression model, the test is a special case of the one-factor ANOVA. For each observation value ( = each group ) is the disturbance, and thus distributed ( with the true regression value in the population ), that is, the conditions of the ANOVA are met. Are all coefficients equal to zero, this is equivalent to the null hypothesis of ANOVA.

Residual analysis, in plotting the residuals on the independent variables, provides information on

  • The accuracy of the assumed linear relationship,
  • Potential outliers,
  • Homoscedasticity, heteroscedasticity.

One goal in the residual analysis is that it checks the condition of unobserved residuals. It is important to note that

Applies. can be calculated with the formula. In contrast, the disturbance is not predictable or observable. According to the assumptions made above would apply to the model

It is therefore present a variance homogeneity. This phenomenon is also referred to as Homoskedastie and is transferable to the residuals. This means that when plotting the independent variables against the residuals, then no systematic patterns should be recognizable.

Example 2 on residual analysis

Example 3 for residual analysis

In the above three figures, the independent variables against the residuals were plotted, and in Example 1, one sees that there is in fact no discernible pattern in the residuals, ie the assumption of homogeneity of variance is met. In Examples 2 and 3, however, this assumption is not true: One recognizes a pattern. For the application of linear regression suitable transformations are therefore here first perform. How to recognize a pattern that resembles a sine function, whereby a data - transformation of the form would be possible here, while in Example 3, a pattern can be seen, reminiscent of a parabola, in this case, is in Example 2 a data - transformation of the form might be appropriate.

Contribution of the individual regressors to explain y

One is interested in whether you can remove individual parameters or covariates from the regression model. This is possible if one parameter is equal to zero, thereby testing the null hypothesis H0: βj = 0 That is, it tests whether the -th parameter is equal to zero, if this is the case, the associated -th from covariate be the model away. The vector b is distributed as a linear transformation of how

If one estimates the variance of the disturbance, one obtains for the estimated covariance matrix

The estimated variance se ( bj ) 2 of a regression coefficient bj is the j-th diagonal element in the estimated covariance matrix. This gives the test statistic

Which is t -distributed with n-p degrees of freedom. Is greater than the critical value T (1 - α / 2, NP), the (1- α / 2) - percentile of the t- distribution with n p degrees of freedom, the hypothesis is rejected. Thus, the covariate Xj is retained in the model and the contribution of the regressor Xj for the explanation of Y is significantly large.

Forecast

Determined to Make a prediction value, one might possibly know at what interval move the predicted values ​​with a specified probability. So you a confidence interval for the average forecast value E ( Y0) is determined. It is calculated as the variance of the forecast

Then obtained as (1 - α ) confidence interval for the average forecast value with estimated variance

Especially for the case of simple linear regression yields the

Especially from this form of the confidence interval can be seen at once that the confidence interval is wider when the exogenous forecast variable x0 from the " center " of the data removed. Estimates of the endogenous variables should therefore lie in the observation room the data, otherwise they are very unreliable.

Example

To illustrate the multiple regression is examined in the following example, as the dependent variable Y: Gross value added ( at constant prices of 95, adjusted, in billions of euros ) of the independent variable " gross value added by industry Germany ( in current prices, in billions of EUR ) " depends. The data can be found in the statistics portal. Since the calculation of the regression model is carried out usually in the computer that is exemplified in this example, as a multiple regression can be performed with the statistical software R.

First, can you impersonate a scatter plot, in this you can see that the entire value is obviously positively correlated with the value creation of the economic areas. This can be recognized that the data points are in the first column of the chart approximately on a straight line with a positive slope. It is striking that the value added in construction is negatively correlated with other sectors. This can be recognized that the data points lie approximately on a straight line with a negative slope in the fourth column.

In a first step, is the model with all covariates in R a

Lm ( BWSb95 ~ BBLandFF BBProdG BBBau BBHandGV BBFinVerm BBDienstÖP ) Then allowed to impersonate a summary of the model with all covariates in R, then we obtain the following list.

Residuals:      Min 1Q Median 3Q Max      -1.5465 -0.8342 -0.1684 0.5747 1.5564 coefficients:              Estimate Error t value Pr h ( > | t | ) ( Intercept ) 145.6533 30.1373 4,833 0.000525 *** BBLandFF 0.4952 2.4182 0,205 0.841493 BBProdG 0.9315 0.1525 6.107 7.67e -05 *** BBBau 2.1671 0.2961 7.319 1.51e -05 *** BBHandGV 0.9697 0.3889 2,494 0.029840 * BBFinVerm 0.1118 0.2186 0,512 0.619045 BBDienstÖP 0.4053 0.1687 2,402 0.035086 * --- Signif. codes: 0 ' ***' 0.001 '**' 00:01 '*' 00:05 '.' 0.1 '' 1 Residual standard error: 1.222 on 11 degrees of freedom Multiple R -Squared: 0.9889, Adjusted R -squared: 0.9828 F- statistic: 162.9 on 6 and 11 DF, p- value: 4.306e -10 The test for goodness of the overall regression model yields a test statistic of F = 162.9. This test statistic has a p-value, hence the adjustment is significantly good.

Analysis of the individual contributions of the variables (Table Coefficients ) of the regression model results at a significance level of 0.05, that the variables BBLandFF and BBFinVerm obviously can BWSB95 insufficiently explain the variable. This can be recognized that the associated t- values ​​for these two variables are relatively small, and thus the hypothesis that the coefficients of these variables are zero can not be rejected.

The variables BBHandGV and BBDienstÖP are barely significant. Especially strongly correlated is Y ( in this example BWSb95 ) with the variables BBProdG and BBBau, which can be seen on the associated high t- values.

In the next step, the insignificant covariates BBLandFF and BBFinVerm be removed from the model.

Lm ( BWSb95 ~ BBProdG BBBau BBHandGV BBDienstÖP ) Then allowed to turn output a summary of the model, then one obtains the following listing.

Residuals:       Min 1Q Median 3Q Max       -1.34447 -0.96533 -0.05579 0.82701 1.42914 coefficients:               Estimate Error t value Pr h ( > | t | ) ( Intercept ) 158.00900 10.87649 14.528 2.05e -09 *** BBProdG 0.93203 0.14115 6.603 1.71e -05 *** BBBau 2.03613 0.16513 12.330 1.51e -08 *** BBHandGV 1.13213 0.13256 8.540 1.09e -06 *** BBDienstÖP 0.36285 0.09543 3,802 0.0022 ** --- Signif. codes: 0 ' ***' 0.001 '**' 00:01 '*' 00:05 '.' 0.1 '' 1 Residual standard error 1.14 on 13 degrees of freedom Multiple R -Squared: 0.9886, Adjusted R -squared: 0.985 F- statistic: 280.8 on 4 and 13 DF, p- value: 1.783e -12 This model provides a test statistic of F = 280.8. This test statistic has a p-value, hence the matching is better than in the first model. This is mainly due to the fact that in the present model, all covariates are significant.

Special Applications of regression analysis

Special Applications of regression analysis are also related to the analysis of discrete and limited in the range of values ​​dependent variable. Distinction can be made on the nature of the dependent variable and type of restriction of range. The following are the regression model which can be used at this point are listed. More details can be found in Frone (1997) and Long ( 1997).

Models for different types of dependent variables ( Generalized Linear Models ):

  • Binary: logistic regression and probit regression
  • Ordinal: Ordinal logistic regression and ordinal probit regression
  • Absolut: Poisson regression, negative binomial regression
  • Nominal: multinomial logistic regression

Models for different types of limited value ranges:

  • Censored Tobit model
  • Truncate: truncated regression
  • Random - selegiert: (sample -selected ) random - selegierte regression

Application in econometrics

For quantitative economic analysis in the context of regression analysis, such as econometrics, are particularly suitable:

  • Growth functions, such as the law of organic growth or the calculation of compound interest,
  • Abschwingfunktionen such as the hyperbolic distribution function or Korachsche price function,
  • Gooseneck functions, such as the logistic function is used in the context of logistic regression, the function or the Potenzexponentialfunktion Johnson,
  • Degressive Saturationsfunktionen such as the Gompertz function or Törnquist function.
69725
de