Overfitting

Overfitting (English overfitting ) denotes a certain correction of a model to a given data set. In the statistics overfitting means specifying a model that contains too many independent variables.

Mathematical definition

Given a hypothesis space H and a hypothesis. Then called via adapted to the training data if there is an alternative hypothesis, so that having a smaller error with respect to the training data, but has a smaller error than with respect to the distribution of all instances.

Statistics

In multiple regression, a model is characterized by overfitting, the additional, irrelevant regressors contains ( explanatory variables ). In contrast, if relevant variables disregarded, one speaks of under fitting.

The inclusion of additional regressors, the coefficient of determination R ², which measures the goodness of fit of the model to the data of the sample shall not fall. By chance effects as irrelevant regressors may contribute to the explanation of the variance and increase the coefficient of determination artificial.

Overfitting is to be assessed as negative, because the actual (lower ) goodness of fit is obscured and the model is indeed better adapted to the data of the sample, but no transferability is to the population. Regression coefficients appear incorrectly as not significant, since their effect can not be estimated with sufficient accuracy. The estimators are inefficient, ie its variance is no longer minimal. Simultaneously, the risk that extraneous variables appear to be statistically significant due to random effects grows. Overfitting thus deteriorates the characteristics of the model estimation, in particular by the fact that an increasing number of the regressors, the number of degrees of freedom is reduced. Large differences between R ² and the corrected coefficient of determination indicate overfitting. Overfitting can be counteracted esp. through proper logical considerations and the application of a factor analysis.

Records and adapted models

First is the selection of the data set, in particular the number of observations, measurement points or sampling, an essential criterion for a reputable and successful modeling. Otherwise, allow the assumptions derived from these data no conclusions as to the reality. This also applies in particular for statistical statements.

The maximum possible complexity of the model ( without going over to fit ) is proportional to the representativeness of the training set and therefore also to their size at a given signal -to-noise ratio. From this it also creates a interdependence for finite -sample bias, so that a possible covering and extensive Trainingsdatensammmlung is desirable.

In other words: if you try to search in existing data according to rules or trends that must choose appropriate data. Who wants to make a statement about the most common letters of the German alphabet, it should consider not only a single set, especially if this is the letter "E" is rare.

Overfitting due to too much training

In the computer-aided modeling, a second effect is added. Here, a data model is adapted to existing training data in a number of training steps. For example, you can train with a few dozen writing samples a computer to recognize handwritten digits ( 0-9) correctly and assigns. The aim here is also to be able to recognize handwriting of persons whose handwriting was not included in the training set.

The following experience is often made: The recognition performance for written digits ( unknown persons ) with increasing number of training steps increases initially. However, after a saturation phase, it decreases again, because the data representation of the computer too much adapts to the spelling of the training data and is no longer based on the underlying forms of the digits to be learned itself. This process has the term overfitting at a core, even if the state of over appropriateness as described above a number of reasons may have.

If the model is not used on the training set is also planned, so if only one model for a completed problem is solved, can of course be of overfitting of the question. An example of this would be if only one computer model for the closed set of priority traffic situations is searched. Such models are significantly less complex than the above, and usually knows the rules already, so that written by people programs here are usually more efficient than machine learning.

Cognitive analogy

Although an over matched model may reflect the training data correctly, as it were, she has " learned by heart ". However, a generalization, which is equivalent to an intelligent classification is no longer possible. The "memory" of the model is too large, so that no rules need to be learned.

Strategies to avoid overfitting

As already mentioned, it is cheap at parametric models, the smallest possible number of parameters to strive for. In non-parametric methods, it is advisable that the number of degrees of freedom as well from the front in analogous limit. In a multi -layer perceptron would mean, for example, a restriction in the size of the hidden layer ( s ) that. A reduction in the number of required degrees of freedom parameter / may be enabled in complex cases, as well, characterized in that prior to the actual transformation of the Klassifikations-/Regressionsschritt data is performed. In particular, methods for dimensionality reduction may make sense (PCA, ICA, etc.). Would.

To benefit from the exercise duration -dependent overfitting in machine learning at least recognize (and thus possibly act ) to be able to, records are often divided not only in not only 2-fold and assigned to a training and validation set, but, for example, a 3 - way split done. Where the quantities respectively and exclusively for training, for "real time control " of the out-of -sample error ( and possibly training in demolition rise) and the final destination of the test quality are used.

Examples

A military image recognition software with the purpose to detect camouflaged tanks on photos, did not work perfectly in training, with new test photos but. The reason: The training photos with tanks had been taken at a different sun than those without armor; the software had actually irrelevant to the Sun attached excessive importance.

Image analysis

628045