Data-Mining

Among data mining [ deɪtə maɪnɪŋ ] - literally that "data mining ", something like: " a mountain of data something valuable extract " - refers to the systematic application of statistical methods to a data set with the aim to identify new patterns. It is also about the processing of very large datasets ( which no longer could be processed manually ), for which efficient methods are needed, whose time complexity makes it suitable for such data sets. The find methods but also for smaller amounts of data application. In practice, especially in the German language, the Anglo-Saxon term " data mining " established for the entire process of so-called " Knowledge Discovery in Databases " (knowledge discovery in databases, KDD ), which also includes steps such as preprocessing, while Data mining actually refers only to the analysis step of the process.

The pure collection, storage and processing of large amounts of data is sometimes misleadingly called the Buzzword data mining. Correctly used, it refers to the extraction of knowledge, the " valid ( in the statistical sense ), previously unknown and potentially useful " is " to determine certain regularities, regularities and hidden relationships ". Fayyad defines it as " a step of the KDD process, which consists of applying data analysis and discovery algorithms that provide efficiency under acceptable limits, a special collection of patterns (or models) of the data".

  • 5.1 Text Mining
  • 5.2 web mining
  • 5.3 Time Series Analysis
  • 6.1 Data - defects
  • 6.2 parameterization
  • 6.3 evaluation
  • 6.4 interpretation
  • 8.1 Legal Aspects
  • 8.2 Moral Aspects
  • 8.3 Psychological aspects

Distinguish it from other disciplines

Many of the methods used in data mining is actually derived from the statistics, especially multivariate statistics and are often matched only in complexity for use in data mining, often approximated it to the detriment of accuracy. The loss of accuracy is often accompanied by a loss of statistical validity, so that the process from a purely statistical point of view and sometimes even can be "wrong". For use in data mining are often, however, the experimentally verified benefits and acceptable running time more crucial than a statistically proven correct.

Also closely related is the theme of machine learning, data mining, however, when the focus is on finding new patterns, while in machine learning primarily known patterns should be automatically recognized by the computer in the new data. However, a simple separation here is not always possible: If for example association rules extracted from the data, so this is a process which corresponds to the typical data mining tasks; but the extracted rules also meet the goals of machine learning. Conversely, the portion of unsupervised learning from machine learning is closely related to data mining. Methods from machine learning often found in the data mining application, and vice versa.

Research in the area of database systems, in particular of index structures for data mining plays a major role when it comes to reduce the complexity. Typical tasks such as nearest neighbor search can be significantly accelerated using an appropriate database index and the running time of a data mining algorithm can be improved.

The information retrieval (IR ) is another area of ​​expertise that benefits from knowledge of data mining. Here is simplified terms to the computer-aided search of complex content, but also the presentation to the user. Data mining methods such as cluster analysis are used here for search results and their presentation to the user to improve, for example, by grouping similar search results. Text mining and web mining are two specializations of data mining, which are closely connected to the information retrieval.

The data collection, ie the recording of information in a systematic manner is an important prerequisite to obtain valid results with the help of data mining can. The data were collected statistically unclean, so a systematic error in the data is available, which is then found in the data mining step. The result is sometimes not a consequence of the observed objects, but due to the way in which the data was collected.

German term

Adequate German translation for the term data mining does not exist.

There are several attempts to find a German name. The Duden does not use it, but the Germanized spelling " data mining " instead of " data mining " in English. Proposals for Eindeutschung example, " data mining " (but it 's not about the recognition of existing patterns, but finding new ) and " data mining " ( an attempt to translate the word literally, but in which the meaning is completely ignored). The foreign words Duden used as a literal translation of " data promotion ", but characterizes it as not appropriate translation. Also, the targeted call for proposals by the Journal of Artificial Intelligence brought no convincing proposals. None of these identifiers could achieve significant distribution, often because certain aspects of the theme such as knowledge discovery are lost and created false associations as for pattern recognition in terms of image recognition.

If you want to use a German identifier, then provides "knowledge discovery in databases " on ( for the English " Knowledge Discovery in Databases "), which encompasses the entire data mining process.

Data mining process

Data mining is the actual analysis step of Knowledge Discovery in Databases process. The steps of the iterative process are roughly outlined:

  • Focus: data collection and selection, but also the determination of existing knowledge
  • Preprocessing: data cleansing, integrated in the sources and inconsistencies are eliminated, for example by removing or complete of incomplete data sets.
  • Transformation into the appropriate format for the analysis step, for example through selection of attributes or the discretization of the values
  • Data mining, the actual analysis step
  • Evaluation of the patterns found by the expert and control of targets achieved

In further iterations can now already used found knowledge ( " integrated into the process"), to obtain additional or more accurate results in a re- run.

Tasks of data mining

Typical tasks of data mining are:

  • Outlier detection: identification of unusual records: outliers, errors, changes
  • Cluster analysis: grouping objects on the basis of similarities
  • Classification: so far no classes associated elements are assigned to the existing classes.
  • Association analysis: identification of correlations and dependencies in the data in the form of rules, such as " From A and B usually follows C".
  • Regression analysis: identification of relationships between ( more ) dependent and independent variables
  • Summary: Reduction of the data set into a more compact description without significant loss of information

These tasks can still be roughly divided into observation problems ( outlier detection, cluster analysis ) and prognosis problems (classification, regression analysis).

Outlier detection

In this task, data objects are searched that are inconsistent with the rest of the data, for example by having unusual attribute values ​​or deviate from a general trend. The procedure Local Outlier Factor investigated, for example, objects that have a significantly different from their neighbors density, one speaks of " density -based outlier identification".

Identified outliers are often then manually verified and disappears from the record, as they may exacerbate the results of other methods. In many applications such as fraud detection but are already the runaway the interesting objects.

Cluster analysis

In the cluster analysis, it comes to identify groups of objects that are in some ways more similar to themselves than other groups. Often it involves clusters in the data space, where the term comes cluster. In a dense connected cluster analysis such as DBSCAN or OPTICS the cluster but can take arbitrary shapes. Other methods such as the EM algorithm or k- means algorithm prefer spherical cluster.

Objects that have been assigned to any cluster, can be interpreted as a break in the sense of the aforementioned outlier detection.

Classification

In the classification it is similar to the cluster analysis, we tried the objects Group ( herein referred to as classes) to assign. In contrast to cluster analysis, the classes are here but usually predefined ( example: bicycles, cars) and it used to previously unassigned objects methods from machine learning to assign these classes.

Association Analysis

In association analysis, frequent relationships are sought in the records and usually formulated as rules of inference. A popular (though apparently fictional ) example, which among other things, in the television series Numbers - mentioned The logic of the crime, is the following: in the market basket analysis, it was found that the product categories " diapers " and " beer " are above average often bought together, usually presented in the form of a final rule " customer buys diapers customer buys beer ". The interpretation of this result was that men, when they are sent to buy diapers from their wives, feel free to even take a beer. Supposedly By placing the beer shelf on the way from diapers to Checkout was the beer sales be increased further.

Regression analysis

For regression analysis, the statistical correlation between different attributes is modeled. This allows inter alia the guidance of the missing attribute values ​​, but also the analysis of the differential analog to outlier detection. If one uses findings from the cluster analysis and calculated separate models for each cluster so typically better predictions can be made. If a strong relationship is established, this knowledge can also be well used for the abstract.

Summary

Since data mining is often applied to large and complex data sets, an important task is the reduction of these data to a manageable amount for the user. In particular, the outlier detection identifies this individual objects that may be important; Cluster analysis identifies groups of objects which are often sufficient to study them only on a sample basis, which significantly reduces the number of data objects to be examined. Regression analysis makes it possible to remove redundant information and thereby reduces the complexity of the data. Classification, association analysis, and regression analysis ( partly also the cluster analysis ) also provide more abstract models of the data.

Using these approaches both the analysis of the data as well, for example, their visualization is simplified ( by sampling and lower complexity ).

Specializations

While most data mining methods attempt to deal with general data as possible, there are also specializations for specific data types.

Text Mining

In text mining deals with the analysis of large textual databases. This can be used, for example, the plagiarism detection, or to classify the text matter.

Web mining

When it comes to web mining, the analysis of distributed data as representing web pages. Not only the pages themselves, but also in particular the relationships ( hyperlinks ) are here for the detection of clusters and outliers, but the pages viewed each other. With the ever-changing content and the non-guaranteed availability of data impose additional challenges. This topic is also closely connected to the information retrieval.

Time series analysis

In the time series analysis of the temporal aspects and relationships play a major role. Here existing data mining methods can be used by means of special distance functions such as the dynamic time warping distance as well as specialized procedures are being developed. A key challenge is then to identify rows with a similar course, although this is somewhat offset in time, but still has similar characteristics.

Problems of data mining

Data defects

Many of the problems in data mining come from an insufficient pre-processing of the data or from systematic errors and bias in their collection. These problems are often statistical in nature and must already be solved in the collection: from unrepresentative data can no representative results are obtained. Here are similar aspects to be considered as in the preparation of a representative sample.

Parameterization

The algorithms used in data mining frequently have a plurality of parameters that are suitable to be selected. On all the parameters they provide valid results, and the parameters to be chosen so that the results are also useful, is a responsibility of the user. If one chooses the cluster analysis algorithm DBSCAN example, the parameters and small, the algorithm finds a finely resolved structure, but also tends to divide clusters into small pieces. If you select the parameter is greater, it finds only the main cluster, which may be already known, however, and thus also not helpful. More advanced methods often have fewer parameters or these parameters are easier to select. For example, OPTICS is a further development of DBSCAN, which largely eliminates the parameter.

Evaluation

The evaluation of data mining results is the user of the problem is that he wants to gain new insights on the one hand, on the other hand, proceedings may be difficult to evaluate automated. In forecasting problems, such as classification, regression analysis and association analysis here the prognosis on new data can be used for evaluation. In describing problems such as outlier detection and cluster analysis, this is more difficult. Clusters are usually assessed internally or externally, ie on the basis of their mathematical compactness and their consistency with known classes. The results of outlier detection methods are compared with known outliers. In both, however, the question arises whether this review really suits the task of " new knowledge " and not, ultimately, the "reproduction of old knowledge " rated.

Interpretation

As a statistical method, the algorithms to analyze the data without background knowledge of their meaning. Therefore, the methods can usually provide only simple models such as groups or averages. Often the results are as such no longer comprehensible. But this machine results obtained must then be still interpreted by the user before they can really be called knowledge.

Areas of application

In addition to the applications in the related fields of computer science data mining is also being used increasingly in industry:

  • Decision support system
  • In the financial sector: Audit for fraud detection
  • Market segmentation, for example, customers in relation to similar buying behavior and interests for targeted advertising
  • Basket Analysis for price optimization and product placement in the supermarket
  • Target group selection for advertising campaigns
  • Customer profile creation for customer relationship management in customer relationship management systems
  • Business Intelligence
  • Attack Detection
  • Referral services for products such as movies and music
  • Network analysis in social networks
  • Web Usage Mining to user behavior to analyze
  • Text mining for analyzing large text collections

Legal, moral and psychological aspects

Data mining as a scientific discipline is first value-neutral. The methods allow the analysis of data from virtually any source, for example, measurements of components or the analysis of historical skeletal remains. However, the analyzed data referring to persons, caused significant legal and moral problems; but typically already in the collection and storage of these data, not only in the analysis, and regardless of the method of analysis (statistical, database queries, data mining, ...) specifically used.

Legal Aspects

Data that has been anonymized inadequate, may be able to be assigned to specific persons again through data analysis ( deanonymisiert ). Typically, however, one will here not use data mining, but simpler and specialized analytical methods for de-anonymization. One such application - and especially the poor anonymization before - then possibly illegal ( according to the data protection legislation). Thus, researchers were able, for example, based on a few questions to uniquely identify user profiles on a social network. For example, if transaction data only pseudonyms, as can using a simple database query ( not technically a data mining! ) Often the users are identified as soon as one knows his place of residence and work: most people can use the 2-3 places where they at spend most of the time, be clearly identified.

Data protection law generally speaks of the " collection, processing or use of" personal data, since this problem occurs not only in the use of data mining, but also in the use of other analytical methods (eg statistics). A reliable protection against abusive analysis is only possible if the corresponding data is not only collected and stored.

Moral aspects

The application of data mining method on personally identifiable data also raises moral questions. For example, if a computer program people should be divided into "classes". In addition, many of the methods are suitable for monitoring and an advanced dragnet. For example, the SCHUFA score by a statistic, perhaps even data mining, won division of people into classes " creditworthy " and " not creditworthy " is and is criticized accordingly.

Psychological aspects

Data mining methods even work neutrally and only calculate probabilities without knowing the significance of this probability. Will people but faced with the result of these calculations, this can be surprised, offended or alienated cause reactions. Therefore, it is important to consider whether and how to confront someone with such results.

Google provides its users an insight into the identified target groups for them - if no opt-out is done - and is often wrong. An American chain of department stores but can see from the purchasing behavior, whether a customer is pregnant. Using this information, shopping vouchers can be sent specifically. Even a prediction of date of birth is possible.

Software packages for data mining

  • Clustan focusing statistical method for cluster analysis
  • Environment for Developing KDD - Applications Supported by Index -Structures ( ELKI ) with a focus on cluster analysis and outlier detection
  • GNU R project with a focus on statistics, skript-/programmiersprachen-orientiert
  • Konstanz Information Miner ( KNIME )
  • RapidMiner ( formerly YALE ( " Yet Another Learning Environment " ) ) with a focus on machine learning, all phases of the entire data mining process from data integration and transformation (ETL ) process, through modeling, automatic optimization and evaluation to operational application and reporting (reporting) covering
  • Waikato Environment for Knowledge Analysis ( WEKA ) Machine with a focus on learning
220023
de