Regression analysis

Regression analyzes are statistical analysis method. Objective in the analysis method most commonly used is to determine relationships between a dependent and one or more independent variables. It is used especially when relationships to describe quantitatively or values ​​of the dependent variables to predict.

  • 5.1 Basic procedure 5.1.1 Linear Regression
  • 5.1.2 Nonparametric Regression
  • 5.1.3 Semiparametric Regression
  • 5.1.4 Robust Regression
  • 5.2.1 Generalized Linear Models
  • 5.2.2 Generalized semi- parametric models
  • 5.3.1 Autoregressive Models

History

The earliest form of regression was the method of least squares (French: méthode the moindres scarves ), published in 1805 by Legendre and 1809 Gauss. Both the method used to determine the orbits of the planets around the sun based on astronomical observations. Gauss published a further development of the theory of least squares in 1821, which contained a version of the theorem of Gauss-Markov.

The term regression was in the 19th century by Francis Galton, a cousin of Charles Darwin, coined. He was describing a biological phenomenon known as regression to the mean, whereby descendants of large parents tend to be only average. For Galton, regression had only this biological meaning. His work, however, was set later Udny Yule and Karl Pearson to a more general statistical context. In their work, it was assumed that the joint distribution of the independent and dependent variable is normally distributed. This assumption could of R. A. Fisher will be weakened later. This worked with the proviso that the conditional distribution of the dependent variable is a normal distribution, the joint distribution but not necessarily. In this respect, Fisher's approach similar to Gauss's formulation of 1821.

Regression methods are still an active area of ​​research. In recent decades estimation methods have been developed, such as for robust regression of non-parametric regression, in the Bayesian statistics on missing data, and error-prone independent variables in many different areas.

Mathematical formulation

Mathematically, the relationship between the independent variables and the dependent variables can be represented as

Herein, the requested or proposed function and the error or the residual of the model.

Applications

  • If the goal is to forecast or prediction, then the determined by the regression method functional relationship can be used to create a predictive model. Now, if additional values ​​are available without an associated x value y, then the adapted model to predict the value of y may be used.
  • When a variable y and a number of variables X1, ..., Xp is received that can be associated with y, then regression methods may be used to quantify the strength of the relationship. So those can be determined xj, who have no connection with y; or those subsets xi, ..., xj, which contain redundant information about y.

Scheme of a regression analysis

Data preparation

At the beginning of each statistical procedure is the preparation of data, in particular

  • The plausibility check. This involves checking whether the data are understandable. This can be done manually or automatically on the basis of validation rules.
  • Dealing with missing data. Frequently incomplete records are omitted, sometimes the missing data are filled even after certain procedures.
  • The transformation of the data. This can be done for various reasons. Example, it can lead to a better interpretability or visualizability the data. They can also be used to bring the data into a form in which the assumptions of the regression process are satisfied. In the case of the linear regression is a linear relationship between the independent and dependent variable and homoskedasticity is assumed about. There are mathematical tools for finding a suitable transformation, in the example of the linearization of the relationship as the Box-Cox transformation.
  • The consideration of interactions ( for linear regression). This addition to the influence of the independent variables, the influence of several variables is considered simultaneously.

Model fit

Using mathematical method, a function f is now determined so that the residuals e are minimal. It may already be largely determined by the method used in the form of the function. Linear regression considered only approximately linear functions f, logistic regression considered only logistic functions. What exactly is " minimal" to be understood, also depends on the method used. For example, the least squares method applied, the sum of squares of deviations f (x), y is minimized, but there are also so-called robust process, which minimize the amount of deviation.

Model validation

An important step of the regression analysis is the model validation. Here, it is checked whether the model is a good description of the link. The model validation process involves the

  • Residual analysis. Many regression methods make assumptions about the residuals e of the model. For example, a certain distribution, and lack of constant variance autocorrelation is assumed. Since the residuals are the result of the procedure, the examination of the assumptions can be made only in retrospect. Typical tools for checking the distribution is the quantile - quantile plot.
  • Overfitting. This occurs when too many independent variables are taken into account in the model. A method of testing for over-fitting is the cross-validation method.
  • Examination of the data for outliers and influential data points. This checks which records not determined function f fit ( outliers) and what data strongly influence the determined function. For these records, a separate investigation is recommended. Mathematical tools for the detection of outliers and influential points are Cook's and Mahalanobis distance.
  • Multicollinearity of the independent variables (for linear models). When x is a linear relationship between the independent variables, one can on the one hand affect the numerical stability of the process and on the other hand complicate the interpretation of the model and of the fitted function. Tool for quantifying the variance are collinearity inflation factor and the correlation matrix.

Forecast

The validated model can be used to predict values ​​of y for given values ​​of x. On the forecasted value of y and a confidence interval is often expressed in order to estimate the uncertainty of the forecast.

For predictions in the range of the data used for model adaptation is called interpolation. Predictions outside the data range is called extrapolation. Before carrying out extrapolations should deal thoroughly with this implied assumptions.

Variable selection and model comparison

Is the goal of the analysis, the identification of those independent variables which are particularly strong in relation to the dependent variable y, several models are often created with different independent variables and compared these models. To compare two models with each other indicators such as the coefficient of determination or the information criterion used in the rule.

There are automated methods such as the so-called stepwise regression, trying successively to identify the one model that best explains this connection. The use of such procedures is still controversial.

Furthermore, there is in the Bayesian statistical approach, a new model derived from multiple models ( by so-called averaging ) and thus attempt to reduce the resulting from the choice of model uncertainty.

Some regression method

The following example is used to illustrate the different methods. Analogous to Mincer ( 1974) were obtained from the Current Population Survey 1985 534 observations randomly drawn with the following variables:

  • Lwage: natural logarithm of the hourly wage,
  • Educ: Vocational training in years and
  • Exper: Years of experience ( = age - Vocational Training - 6)

Mincer examined the relationship between the logarithm of the hourly rate (dependent variable ) and the professional training and experience ( independent variables). The following graphs show there is left is a spatial representation of the regression surface and right, a contour plot. Positive residuals are reddish, negative residuals are drawn bluish and the brighter the observation is the smaller the absolute value of the residual.

  • Linear regressions

.

Basic methods

Linear Regression

In the linear regression, the model is specified such that the dependent variable y is a linear combination of the parameters ( = the regression coefficient ), but not necessarily independent variables. For example, the simple linear regression model with the dependence of an independent variable x:

In multiple linear regression, multiple independent variables or functions of the independent variables are taken into account. Is added to the example of the term to the previous regression, we obtain:

Although the expression on the right side in the independent variable x is square, the term is linear in the parameters, and. Thus this is a linear regression.

The method of least squares is used to determine the model parameters.

Nonparametric regression

For non-parametric regression procedure the form of the functional relationship f is not specified, but largely derived from the data. The estimation of the unknown regression function at the point the data will be close to that point with a greater weight than data points that are far away from this.

In order to estimate various regression methods have become established:

  • Kernel regression:
  • Locally constant linear regression ( Nadaraya -Watson estimator ),
  • Local linear regression (local linear estimator ) or
  • Local polynomial regression (local polynomial estimator )
  • Multivariate Adaptive Regression Splines

Semiparametric regression

A disadvantage of the nonparametric regressions is that they suffer from the curse of dimensionality. That is, the more explanatory variables there are, the more observations are needed to estimate at any point in the unknown regression function reliably. Therefore, a series of semi - parametric models has been established that extend the linear regression or use:

  • Additive models
  • Additive models
  • Index models

Here the unknown regression function is also represented as a sum of univariate regressions nichtparameterischer of indices:

  • Index models

Robust regression

Regression method based on the least squares or maximum likelihood method method is not robust with respect to outliers. Robust regression methods have been developed to overcome this weakness of the classical method. So can be used, for example, alternatively, M- estimator.

Generalized method

Generalized Linear Models

In the classical linear regression assumes that the residuals e are normally distributed. The model assumption is weakened in the generalized models where the residuals e can have a distribution from the class distribution of the exponential family. This is made possible by the use of

  • A known link function, depending on the class distribution of the residuals, and
  • The maximum likelihood method for the determination of model parameters:

A special case of generalized linear models is the logistic regression. When the dependent variable is an ordinal variable that can finally take many values ​​only two or, one often uses the logistic regression.

With (depending on the class distribution of the residuals ). An alternative would be the probit model.

Generalized semi- parametric models

This idea has also been adopted for the semi- parametric models:

  • Generalized additive models ( GAM)
  • Generalized partially linear models ( GPLM )
  • Generalized additive partially linear models ( GAPLM )

Specific methods

Autoregressive models

If the data points are ordered (for example, if it is in the data by a time series), it is about in the AR model and the ARCH model possible to use previous data as an "independent " variables.

299027
de