Imputation (statistics)

The term imputation method are summarized in the mathematical statistics with which missing data in statistical surveys - to be completed in the data matrix - the so-called non-response. The silence distortion caused by the non-response is reduced.

General

The imputation is one of the so-called missing data techniques, ie methods that are applied in the evaluation of incomplete sample data sets. This problem occurs in surveys and other surveys are relatively common, for example, when some people questioned deliberately not answer due to lack of knowledge or insufficient motivation to answer certain questions, but are also conceivable incomplete data due to technical breakdowns or data loss.

In addition to imputation include primarily the so-called process of elimination (also: Complete -case analysis) to the usual missing data techniques. Here are all of the records in which have one or more collection characteristics missing values ​​deleted from the data matrix, so that in the end remains a complete data matrix for evaluation purposes. Although this method is very simple but has considerable drawbacks: in particular, with a larger number of item non responses it has a significant loss of information. Furthermore, this technique can lead to a distortion of the remaining sample, if the system of data loss depends on the characteristics of the feature incompletely collected. As a common example Polls apply with regard to income, where it may happen that even people with a relatively high income of this state and it therefore tends to reluctantly in such cases comes to missing data. In order to get this problem under control as possible, imputation methods have been developed in which attempts are not easy to ignore missing data, but instead to be replaced by plausible values ​​that can be estimated, among others, using the observed values ​​of the same data set.

Selected imputation

There are a number of methods by which missing values ​​are completed. We distinguish roughly between the singular and multiple imputation. In the former, each missing value is replaced by some certain estimated, while in the multiple imputation for each item non response, several values ​​are estimated, usually by means of a simulation on the basis of one or more distribution models.

Singular imputation

Substitution by measures of central tendency

One of the simplest imputation method is to integrate all the missing values ​​of a survey characteristic by the empirical measure of central tendency of the observed characteristics - to replace - so usually the mean, or if not quantitative characteristics median or mode. However, this method has the disadvantage that there - similar to a method of eliminating - distortion if the data loss is dependent on the expression of that trait. Furthermore, the resulting sample has a systematically underestimated standard deviation, since the imputed values ​​are constant and therefore do not exhibit any scattering. These problems can be mitigated in part if the method is not applied uniformly for the entire sample, but separated according to specific feature classes into which the records are classified according to the characteristics of a particular, fully raised feature. Accordingly, for each of these classes separately a class means are calculated, are replaced by the missing values ​​within the class.

Substitution ratio estimator

The replacement by a ratio estimator is a relatively simple procedure that attempts to exploit a possibly existing functional relationship between two sampling characteristics in the estimation of Imputationswerte, one of which could be completely observed. Let X and Y be two random variables that are collected in a sample of size n, where X could be completely collected and investigation of n objects is also present, the Y value. Each of the missing Y- forms can then be estimated by a ratio estimator:

For all

Here are

Or

It should be noted that this estimator is reasonably applicable only in special cases, usually when a strong correlation can be assumed between X and Y.

Hot-deck and cold -deck techniques

The method, referred to as a hot deck or cold deck, all of which have the peculiarity that this missing sample values ​​are replaced by observable manifestations of the same feature. They differ only in respect to the method with which the Imputationswerte be determined. While the estimates from other surveys are used ( for example, from historical, "cold" surveys) in the cold- deck techniques that use significantly more common hot-deck method, the current data matrix. Usually Deck techniques are applied within Imputationsklassen, so feature classes into which the records can be classified according to the values ​​of a characteristic fully charged.

A well-known hot-deck method is the so-called sequential or even traditional hot deck. The procedure here is the following: In the incomplete data matrix a Imputationswert is first within each Imputationsklasse for each variable incompletely observed each set as start value. Here, the sequential methods differ in how the starting values ​​are determined; is conceivable, for example, the average of the existing class characteristics, a random value of the relevant class, or even a cold deck estimate. Once the initial values ​​are set, you go now through all elements of the data matrix. If for an object the expression exists, it is the new Imputationswert for each characteristic in the same Imputationsklasse, otherwise, the current for this feature Imputationswert is put in the place of the missing expression. We will proceed with all elements of the data matrix until it has no more gaps.

Regression method

The imputation procedures that are based on regression analysis, all have in common is that they attempt to exploit any functional relationships between two or more sample characteristics in the estimation of missing values ​​. In the above imputations by the sample mean or a ratio estimator, it also is a simplified form of Regressionsimputation. In general, there both different numbers of features to be included, as well as various regression methods are considered. For quantitative characteristics often exploits the linear regression by the least squares method. Let X and Y be two random variables n are collected together in a sample of size, and let Y has been raised only once. Exists between the two variables assumed to be a correlation, ( x, y) value pairs can from the observed a regression equation of Y on X of the form can be calculated:

For all

Here are Alpha and Beta, the regression coefficients from the observed (x, y) value pairs by their least squares estimators and appreciated:

The regression estimation with more than one Regressormerkmal - the so-called multiple regression or multivariate regression - is carried out similarly, but is computationally intensive due to the larger amount of data then available. By default this is implemented in statistical software packages such as SPSS.

Is not quantitatively an incompletely observed feature can be estimated by linear regression does not work out. However, there are special regression techniques, of which the logistic regression is the most well-known for certain categorical variables.

Multiple Imputation

In multiple imputation is a comparatively sophisticated missing data procedures. Essentially this means " multiple " that this procedure for each missing value equal provides several estimates in several Imputationsschritten. These can then be averaged to give an estimated value, or it may for each one new Imputationsschritt completed data matrix are positioned. A common approach, the estimated value determination is the simulation of a deemed plausible multivariate distribution model. For example, if the two random variables X and Y are assumed to be jointly normally distributed with fixed parameters, can at pairs of values ​​with observed X value and the absence of Y- value in each case the conditional distribution of Y, the observed X value are derived - in this simple case, a univariate normal distribution. Then it is possible to generate the possible Imputationswerte during multiple simulation of the respective distribution for each missing Y value.

410755
de