Logistic regression

In logistic regression or logit model is meant regression analyzes for (usually multivariate ) modeling the distribution of discrete dependent variables. When logistic regressions are not marked closer than multinomial or ordered logistic regressions, the binomial logistic regression for dichotomous dependent variables is usually meant. The independent variables here can have an arbitrary scale level, with discrete variables with more than two levels are broken down into a series of binary dummy variables.

In the binomial case you have data of the type with a binary dependent variable indicates ( the so-called regressands ), with, a known and fixed Kovariablenvektor of regressors occurs. is the number of observations.

Motivation

The influences on discrete variables can not be studied using the methods of the classical linear regression analysis, as key application requirements, in particular a normal distribution of residuals and homoscedasticity are not met. Furthermore, a linear regression model in such variables lead to inadmissible predictions: When encoding the two forms of the dependent variable with 0 and 1, so you can indeed interpret the prediction of a linear regression model as a prediction of the probability that the dependent variable has the value 1 assumes - formal: - but it can happen that values ​​are predicted outside this range. Logistic regression solves this problem by an appropriate transformation of the dependent variable.

The relevance of the logit model is also made clear that Daniel McFadden and James Heckman were awarded in 2000 for their contribution to its development, the Nobel Prize in Economics.

Application requirements

In addition to the nature of the variables as has been illustrated in the introduction, there are a number of application requirements. Thus, the regressors should not have high multicollinearity.

Model Specification

The ( binomial ) logistic regression model is

Applies here.

It is based on the idea of ​​the odds, that is, the ratio of the counter probability or ( at encoding the alternative category with 0)

While the odds may assume values ​​greater than 1, but their range of values ​​bounded from below ( it approaches asymptotically to 0 ). An unlimited range of values ​​is determined by the transformation of the odds in the so-called logits

Achieved; this can take values ​​between minus and plus infinity.

In the logistic regression is then the regression equation

Estimated; So it regression weights are determined according to which the estimated logits can be calculated for a given matrix of independent variables X. The following graphic shows related as logits (X- axis ) with the output probabilities ( y-axis):

The regression coefficients of the logistic regression are not easy to interpret. Therefore, it often forms the so-called effect coefficients by Exponenzieren; the regression equation thus refers to the odds:

The coefficients are often referred to as effect coefficient. Here are coefficients less than 1 a negative effect on the odds, a positive influence is when.

By a further transformation can the influences of the logistic regression as influences on the probabilities expressed as:

Estimation method

Unlike the linear regression analysis is not possible directly to calculate the best regression line. Therefore, a maximum likelihood solution is generally estimated using an iterative algorithm.

Model diagnosis

The regression parameters are estimated based on the maximum likelihood method. Inferential methods are available for both the individual regression coefficients as well as for the overall model are available (see Wald test and the likelihood-ratio test); in analogy to the linear regression model, methods of regression diagnostics have been developed on the basis of which individual cases can be identified with oversized influence on the results of the model estimation. Finally, there are some proposals for calculating a size that an estimate of the " variance explained " allowed in analogy to the linear regression; one speaks of so-called pseudo - coefficient of determination. Also, the AIC and BIC are used occasionally in this context.

In particular, models for risk adjustment of the Hosmer - Lemeshow test is used to evaluate the goodness of fit often. This test compares the predicted with the observed rates of events in ordered according to probability of occurrence of subgroups of the population, often the deciles. The test statistic is calculated as follows:

Here Og represent the observed ( Observed ) events, Eg the expected events, Ng the number of observations and πg the occurrence probability of GTEN quantiles. The number of groups is n

Also ROC curves are used to assess the predictive power of logistic regressions, the ROC AUC acts as a quality criterion.

Alternatives and Enhancements

As (essentially equivalent ) may be used Alternatively, the probit model, wherein a normal distribution is assumed.

A transfer of the logistic regression (and the probit model ) on a dependent variable with more than two discrete features is possible (see multinomial or ordered logistic regression ).

527628
de