Data cleansing

For data cleanup ( engl. data cleaning or data editing) include various methods for removing and correcting data errors in databases or other information systems. The errors can consist for example of incorrect (originally wrong or outdated), redundant, inconsistent or incorrectly formatted data.

Essential steps for data cleansing are the duplicate detection (detection and merging of identical records) and data fusion ( merging and completing incomplete data).

Data cleansing is a contribution to improving the quality of information. However, information quality also affects many other characteristics of data sources ( credibility, relevance, availability, cost, ... ), which can not improve themselves by means of data cleanup.

Process for data cleansing

The process to clean the data is divided into two consecutive steps ( Apel, 2009, p 157):

Standardize data before corrections

For a successful adjustment, the data must first be standardized. To this end, these are first structured and normalized thereafter.

The structuring brings the data into a unified format, for example, while a date is placed in a uniform data format (01/09/2009 ). Or composite data are broken down into their component parts, such as the name of a customer in the name components address, title, first name and last name. Most of such structuring are not trivial and are carried out with the help of complex parsers.

In the normalization, the existing values to a normalized value list are mapped. This normalization can be performed, for example, for the address, the academic title or company additives. For example, the company additives e Kfr and Kfm by the normalized value ek be replaced, making the subsequent cleanup is greatly simplified.

Clean up data

For the cleanup of the data, there are six methods to choose from, that can be applied individually or in combination:

Deriving from other data: From other data, the correct values are derived (eg, the address of the first name ).
Replace with other data: replaced the faulty data with other data (eg, from other systems ).
Use default values : There are default values used in place of the erroneous data.
Remove erroneous data: The data are filtered out and not further processed.
Remove Duplicates: Duplicates are identified by the duplicate detection, consolidates the non-redundant data from the duplicates and it formed a single record.
Summaries unravel: In contrast to the removal of duplicates in this case incorrectly summarized data are separated again.

Storage of incorrect data

After you have adjusted the data, one should simply delete the original, erroneous data in any case. Otherwise, the adjustments would not be comprehensible, also would not be such a process audit-proof.

An alternative is to store the corrected value in an additional column. As additional space is needed, this approach is recommended only in a few columns in a record to be corrected. A further possibility is to store in an additional line, which the memory requirements but can increase even more. Therefore, it is only appropriate for a small number to be corrected records. The last possibility in a large number to be corrected rows and columns is to create a separate table.

References

Detlef Apel, Wolfgang Behme, Rüdiger Eberlein, Christian Merighi: successfully manage data quality. 2009, Hanser textbook, ISBN 978-3-446-42056-4.

219471