Text Mining

Text mining, rarely also text mining, text data mining or Textual Data Mining is a bunch of algorithm -based method of analysis for discovery of deriving high-quality information from text. Using statistical and linguistic resources opens up text mining software from texts structures to enable the user to identify the core information of the processed texts quickly. Optimally, text mining systems provide information from which the user does not know beforehand whether or not and that they are included in the processed texts. With targeted application of text mining tools are also able to generate hypotheses to test this and gradually refine.

  • 2.1 Data Material
  • 2.2 Data Preparation dictionaries 2.2.1
  • 2.3.1 Cluster method 2.3.1.1 - means clustering
  • 2.3.1.2 Hierarchical Cluster
  • 2.3.1.3 Self- Organizing Maps
  • 2.3.1.4 Fuzzy Clustering
  • 3.1 Pure Text Miner 3.1.1 Generic applications
  • 3.1.2 Domain-specific applications
  • 3.3.1 Link Analysis

Concept

Introduced in 1995 by Ronen Feldman and Ido Dagan as " Knowledge Discovery from Text ( KDT ) " in the research terminology text mining is not a clearly defined term. In analogy to Data Mining in the Knowledge Discovery in Databases (KDD ) Text mining is a highly automated process of knowledge discovery in textual data that will facilitate an effective and efficient use of available text archives. Comprehensive can text mining as the process of compilation and organization of formal structuring and algorithmic analysis of large document collections for on-demand information extraction and detection of concealed contents, relations between texts and text fragments are seen.

Typologies

The different views of text mining can be sorted by different types. This types of information retrieval (IR), of document clustering, the text data mining and KDD are repeatedly referred to as subtypes of text mining.

When IR is known that the text data contain certain facts to be found by means of suitable search queries. In the data mining perspective text mining is understood as " data mining on textual data " for the exploration of ( requiring interpretation ) data from texts. The most extensive type of text mining is the actual KDT should be extracted at the new, previously unknown information from the texts.

Related Procedures

Text mining is related to a number of other methods, of which there can be delineated as follows.

Most closely resembles the text mining data mining. With this, it shares many methods, but not the subject: While data mining is usually applied to highly structured data, text mining is concerned with much weaker structured text data. In text mining, the primary data are therefore in a first step more structured to allow their development with methods of data mining. Unlike most of the tasks of data mining also multiple classifications are usually expressly desired in text mining.

Furthermore, accesses text mining method back to the information retrieval, which are designed for the discovery of those text documents to be relevant for answering a query. In contrast to the text mining that is not potentially unknown meaning structures be developed for the full text material, but identified a lot of relevant hoped for individual documents using known keywords.

Methods of information extraction aim to extract from texts individual facts. Information extraction often uses the same or similar process steps as mining is done in the text; sometimes information extraction is therefore regarded as a branch of text mining. In contrast to ( many other types of ) text mining at least the categories here but known to whom information is sought - the user knows what he does not know.

Method of automatic summarizing of texts, the text extraction, create a condensate of a text or a text collection; here, but in the texts gone unlike the text mining, not the explicitly available.

Areas of application

Web mining, Web content mining in particular, is an important field of application for text mining. Still relatively new are attempts to establish text mining as a method of social science content analysis, for example, Sentiment Detection for automatic extraction of attitudes towards a topic.

Methodology

Text mining is going on in several standard steps: First, a suitable data material is selected. In a second step, the data are processed so that they can be analyzed in the following by means of various methods. Finally, the presentation of results plays an unusually important part of the process. All process steps are supported by software here.

Data material

Text mining is applied to a (usually very large ) amount of text documents that present some similarities in terms of size, language and topics. In practice, these data come mostly from large text databases such as PubMed or LexisNexis. The analyzed documents are unstructured in the sense that they do not have uniform data structure, so it is also called "free format ". Nevertheless, they have semantic, syntactic, often typographical and rarely also markup- specific structural features, rely on the text - mining techniques; therefore we also speak of weakly structured or semi-structured text data. Usually the documents to be analyzed are taken from some universe of discourse (domain ), which may be strongly delineated ( eg sociology ) more (eg genomics ) or less.

Data preparation

The actual text mining requires a computational linguistic processing of the documents. This is typically based on the following, only partially automated steps.

First, the documents are stored in a common format - nowadays mostly XML - transferred.

For text representation, the documents are then tokenized mostly based on characters, words, words ( terms) and / or so-called concepts. Here in above units increases the strength of the semantic meaning, but also the complexity of their operationalization, often hybrid methods are therefore employed for tokenization.

In the following words in most languages ​​must be lemmatized, ie, are reduced to their morphological base form with verbs so for example the infinitive. This is done by stemming.

Dictionaries

To fix some problems digital dictionaries are required. A stop dictionary removed those words from the data to be analyzed, in which little or no predictive power is expected, as is the case, for example, often with articles like "a" or "an". In order to identify stop words, lists are often created with the most commonly occurring in the corpus words; these usually contain in addition stopwords even most domain-specific expressions are created for the normally also Dictionaries. The major problems of polysemy - the ambiguity of words - and synonymy - the equal importance of different words - be solved by means of dictionaries. ( Often, domain-specific ) thesauri, which attenuate the synonym problem, are automatically generated increasingly large corpora.

Depending on the type of analysis, it may be possible that phrases and words are linguistically classified by part-of -speech tagging, often this is not necessary for text mining.

  • Pronouns ( he, she) must be assigned to the preceding or following noun phrases (Goethe, the police ), to which they refer, ( anaphora resolution ).
  • Proper names of people, places, from companies, governments, etc., must be detected because they have a different role in the constitution of the text meaning as generic nouns.
  • Ambiguity of words and phrases is thereby resolved that every word and phrase is accurately attributed a meaning (determination of word meaning disambiguation ).
  • Some words and phrases (parts) can be assigned to a subject area ( term extraction ).

In order to determine the semantics of the text analyzed data better, is usually resorted to topic -specific knowledge.

Analysis method

On the basis of these partially structured data, the actual text mining methods to build, mainly on the discovery of co-occurrences, ideally between concepts are based. These methods are:

  • In texts implicitly make existing information explicitly
  • Relationships between information, which is represented in various texts make visible.

Core operations of most methods are the identification of (conditional) distributions, frequent quantities and dependencies. A large role in the development of such methods plays machine learning, both in its supervised and unsupervised in his version.

Cluster method

In addition to the traditionally most widely used cluster analysis methods - Means and hierarchical clustering - even self- organizing maps are used in clustering methods. In addition, more and more methods rely on fuzzy logic.

- means clustering

Very often, clusters are formed when text mining -means. The belonging to these clusters algorithm aims to minimize the sum of Euclidean distances within and across all clusters. The main problem is to determine the number of clusters to be found, a parameter which must be determined by the analysts using its prior knowledge. Such algorithms are very efficient, but it may happen that only local optima are found.

Hierarchical Cluster

In the hierarchical cluster analysis also popular documents in a hierarchical cluster tree (see figure) according to their similarity grouped. This process is clearly computationally demanding than the one for - means clustering. Theoretically, you can do this so that dividing the amount of documents in successive steps or by first conceives each document as a separate cluster and the most similar clusters in the result aggregated gradually. However, in practice only the latter approach usually leads to meaningful results. In addition to runtime problems further weakness is the fact that you already requires background knowledge about the expected cluster structure for good results. As with all other methods of clustering must ultimately decide whether the clusters found meaning structures reflect the human analyst.

Self-organizing maps

The first developed in 1982 by Teuvo Kohonen approach of self-organizing maps is another common approach to clustering in text mining. It (usually two-dimensional ) artificial neural networks are created. These have an input plane in which each text to be classified document is represented as a multidimensional vector and the neuron is assigned as the center, and an output layer, in which the neurons are activated according to the order of the selected distance measure.

Fuzzy Clustering

Also based on fuzzy logic clusters are increasingly used because many - in particular deictic - language entities can be adequately decoded only by the human reader and thus arises an inherent uncertainty in the computeralgorithmischen processing. Since they account for this fact, offer fuzzy clusters so usually above-average results. Typically, it resorted to Fuzzy C -Means. Other applications of this type rely on Koreferenzcluster graph.

Vector method

A large number of text mining method is vector-based. Typically, the characters appearing in the analyzed documents are represented in terms of a two dimensional matrix, where T is the number of terms, and d is defined by the number of documents. The value of the element is determined by the frequency of terms in the document, often, the frequency rate thereby transformed, mostly by those at the Matrizenspalten vectors are normalized, in which they are divided by their sum. The resulting high-dimensional vector space is mapped into the sequence to a significantly Niederdimensionaleren. Since 1990, the Latent Semantic Analysis (LSA ) is playing an increasingly important role that traditionally relies on singular value decomposition. Probablistic Latent Semantic Analysis ( PLSA ) is a more statistically formalized approach which is based on latent class analysis and used to estimate the latent class probabilities an EM algorithm.

Algorithms, however, are based on LSA computationally intensive: A normal desktop computer of 2004 vintage can be so hard to analyze more than a few hundred thousand documents. Slightly worse, but less computationally expensive results than LSA achieve covariance based on vector space method.

The analysis of relationships between documents by such -like reduced matrices makes it possible to identify documents that relate to the same facts, even though their wording is different. Evaluation of relationships between terms in this matrix it possible to produce associative relationships between terms, which often correspond to semantic relations and can be represented in an ontology.

Presentation of results

An unusually important and complex part of the text mining assumes the presentation of results. This includes both tools for browsing as well as for visualizing the results. Often the results are presented here on two-dimensional maps.

Software

A number of application programs for text mining exist; often these are specialized in specific areas of knowledge. In technical terms can be pure text Miner, extensions of existing software - for example, for data mining or content analysis - and programs that accompany only partial steps or areas of text mining, differ.

Pure Text Miner

Generic applications

  • Mega Puter text analyst

Developed by Mega Puter text Analyst is one of the most frequently used text- Miner. This program is one of the first text-mining programs, which has been used in social science research.

  • Mega turkey PolyAnalyst
  • Leximancer

The first, developed at the University of Queensland from 2000, vector-based Leximancer reaches for his modeling concept maps on Bayesian co-occurrence pictures back. The algorithm of the program is based on a jump force model for the many- body problem. The output map of the program visualizes word frequency by magnitudes, the closeness of the concepts with the overall context of hierarchical appearance of the concepts and the relative closeness of the concepts to each other by different strong rays. Thus, the program is at the upper end of the scale automation.

  • Clear Forest Text Analytics Suite

One no longer more advanced Text Miner is WebFountain IBM.

Domain-specific applications

  • GeneWays

Developed in Columbia University GeneWays even covers all steps of text mining, but it attacks other than the Clear Forest displaced programs much more on domain -specific knowledge back. The program is limited thematically to genetic research and is applying itself to the majority of its tools for data preparation and less the actual text mining, and the presentation of results.

  • Patent Researcher

Extensions of existing software suites

  • Text mining module for R tm
  • Text processing module for KNIME
  • RapidMiner
  • NClassifier
  • WordStat

Our offered Provalis Research Software module WordStat is the only program for text mining, which is both a statistical application - connected - SimStat - as well as with software for computer - assisted qualitative data analysis - QDA Miner. Thus, the program is particularly suitable for triangulation of qualitative social science methods with the quantitative -oriented text mining. The program offers a number of clustering algorithms - to and visualization of the cluster results - hierarchical clustering and multidimensional scaling.

  • Clementine

SPSS provides the SPSS Clementine module that offers computational linguistic methods for information extraction, suitable for dictionary creation, and makes Lemmatisierungen for different languages.

  • SAS Text Miner

The SAS Institute SAS Enterprise Miner provides for additional program to the SAS Text Miner, which offers a series of text clustering algorithms.

Part provider

  • LingPipe
  • HotMiner

Link Analysis

  • Pajek
  • UCINET
  • NetMiner
470517
de