Information extraction

In Information Extraction ( engl. Information Extraction, IE ) is the engineering- application of techniques from the practical computer science, artificial intelligence and computational linguistics to the problem of automatic machine processing of unstructured information with the aim of knowledge with respect to a defined beforehand domain to. win A typical example is the extraction of information on mergers ( engl. merger events), with about of online news instances of the relation merge ( Company1, Company2, date) are extracted. The information extraction is of great importance, since many information exists in unstructured (not modeled relational ) form, for example the Internet, and this knowledge is deducible through better information extraction.

Information Extraction

Information extraction can be viewed from two different perspectives. On one hand, as the recognition of certain information - such designated about Grishman IE as "the automatic identification of selected types of entities, relations, or events in free text" ( Grishman 2003) -, on the other hand, as the removal of information that is not wanted. The latter point of view expressed about a definition of Cardie from: "An IE system takes as input a text and ' summarizes ' the text with respect to a prespecified topic or domain of interest" ( Cardie 1997). In this sense, one might also information extraction as a targeted Text Extraction designate (see Euler 2001a, 2001b). Information extraction systems are therefore always at least on a specific field, usually even on specific areas of interest ( scenarios ) within a general subject area (domain) aligned. So would be about in the domain ' business news ' a possible scenario ' personnel changes in a management position '. A further restriction does Neumann when he writes that the goal of the IE " the construction of systems " is, " which can specifically track and domain-specific information from free text and structure [ ... ] " ( Neumann 2001, emphasis added). In this context it should be noted that such a restriction has consequences for the technical realization of an information extraction system.

Delineation of neighboring areas

To be distinguished is the independent research field of information extraction from related fields: Text Extraction has a comprehensive summary of the contents of a text to the target ( the comprehensive automatic text summarization is problematic insofar as that human readers together in the task, the most important of a text, never complete will achieve compliance when it is not specified how the information should be important). Text Clustering means the independent grouping of text, text classification, the ability to classify texts into predefined groups. Using information retrieval can search for documents in a document set ( full text search), or - according to the literal meaning - be meant the general formulated task of retrieval of information ( cf. Strube et al 2001. ). Data mining generally refers to the "process, to identify patterns in data " ( Witten 2000:3 ).

Applications

In general, two types of application of information extraction distinguished: First, the extracted data can be immediately thought of a human observer. In this application example, the Euler (2001a ) falls for testing developed system that extracted from e-mails passes information as SMS, or a system that extracted in a search engine to results displays information about the offered positions in job advertisements.

Second, the data for machine processing can be thought, be it for storage in databases, text categorization or classification, or as a starting point for a comprehensive text extraction. Where the information sought from several pieces of information determines the application certain demands on the information extraction system. So must be a mechanized processing structured information, while for further processing directly by the people also can satisfy an unstructured result.

If the required information is not composed of other items of information, as in the recognition of proper names, such distinction is unnecessary.

Evaluation criteria

For the assessment (evaluation) of information extraction systems in the information retrieval commonly used criteria for comprehensiveness and accuracy ( recall and precision ) and the determined from these values F measure are used. Another criterion for evaluating the quality of the extract, the proportion of unwanted information ( fall-out ).

Message Understanding Conferences

The development on the relatively young research field of information extraction has been driven primarily by the Message Understanding Conferences ( MUC). The seven MUC were from 1987 to 1997 by the 'Defense Advanced Research Projects Agency ' ( DARPA ) - the central research and development facility of Defense of the United States - held. Default scenarios were news about nautical operations ( MUC- 1 in 1987 and MUC- 2 1989) on terrorist activity (MUC- 3 in 1991 and MUC -4, 1992), joint ventures and microelectronics (MUC- 5 1993), personnel changes in the economy ( MUC -6, 1995), as well as spacecraft and rocket launches (MUC- 7 1997) ( Appelt and Israel 1999). As for the joint evaluation of a standardized output format was necessary, was used from the second MUC a common issue the template ( template), which is why almost all of the information extraction systems provide a structured output of the extracted information, an exception to this is Euler ( 2001a, 2001b, 2002).

Summary

Information extraction systems can be used for different tasks of the automatic analysis of job ads to preparing for a Generic Text Extraction. According to these requirements, the systems can provide structured or unstructured results. Further, the systems may have completely different linguistic depth of extraction by selective summary ( Euler 2001a, 2001b, 2002) with pure record filtering where only semantic orientation in the form of the word list is given to systems with analysis modules for all levels of language ( phonology, morphology, syntax, semantics, pragmatics possibly also ). In some areas leads our lack of understanding of the functioning of natural language to a stagnation of development, but as information extraction represents a more limited role as a complete understanding of the text, are often in the sense of "appropriate language engineering" ( Grishman 2003) the requirements of appropriate solutions (perhaps also possible, particularly in connection with the neighboring regions ). An example of this may the Euler ( 2001a, 2001b, 2002) are designed process that provides only unstructured results in contrast to the dominant IE systems. There achieves high performance by F- measure and requires only a small or even minimal Annotierungsaufwand of the training corpus, which could mean a high degree of portability to new domains and scenarios, such as in the form of a creation of word lists en passant in a text classification.

Unstructured data Relation (database) Automatic summarization Data-Mining Document classification Evaluation (disambiguation) Annotation

412224