UIMA

UIMA ( Unstructured Information Management Architecture, German Architecture for managing unstructured information ) is a framework for the programming of data mining applications, namely for knowledge extraction.

The UIMA project was started by IBM in 2005 and is run by Apache since October 2006. The aim of the project is to provide a standardized framework for building applications for processing unstructured information, in particular natural language (natural language processing, NLP) has to offer. Unstructured information may be in any format, such as image or audio data, but texts are the most common information.

The concept of UIMA provides that a pipeline is implemented in the first data is read, go through this then various analysis and processing steps and finally to one or more so-called consumers are supplied, process the results, eg in a store database. In each analysis step, the data will be provided with specific annotations, that is, a defined region of the data set, so for example a part of the text, gets a note. Due to the strong modularity in pipeline stages, the individual stages can be easily reused.

An example of a pipeline is a simple application that will calculate the average number of words per sentence in a text. For this purpose, a pipeline stage is first required that reads the text, for example from a file. The second stage runs through the text and highlights all words by all positions of spaces in the text are identified. The third stage performs analogous to a set recognition by by markers set of punctuation to punctuation. These two steps are independent of each other and can therefore be interchanged. The last pipeline stage must now only the number of marked words and output share by the number of labeled sentences.

An extension may be now, the number of verbs per sentence to count, this one part of speech recognition would after the third stage built in, so the accompanying each word with an annotation such as " verb ", " noun", and the consumer would take the word the part of speech annotations annotations include the " verb" meet; all other parts of the pipeline can be re-used. UIMA does in this application, the management of the pipeline and the internal representation of data to be processed together with annotations, it also provides the developer with all the necessary interfaces for reading and reading the information.

UIMA is used in particular in the research, but also develops increasingly becoming the industry standard. One of the best known applications of UIMA is to use the IBM Watson.

790272
de