Unstructured data

In economics, computer science and computational linguistics unstructured data are digitized information present in a non-formalized structure and can not be accessed, aggregated to the thus of computer programs from a single interface. Examples are digital texts in natural language and digital sound recordings of human speech.

Classification

Differences are unstructured data from structured and semi-structured data. Considering an e- mail, this is in a certain structure: It contains a receiver, a sender, and possibly a title. Making it one of the semi-structured data. The content of the email itself, however, is structureless.

The automatic usability of unstructured data is limited by the fact that there are no data model for them and usually no metadata. Also in text documents metadata and data are mixed. To gain structures from this modeling is required. Furthermore, speaking of unstructured data in connection with the filing of documents without existing data warehousing. Thus, these are not indexed and can not be searched together accordingly.

Importance

Many data are unstructured in their origin. You gain structure, by being brought by human intervention in a schema. The process of patterning can cause disadvantages, since it is often associated with a loss of information. In the business environment are often important information in unstructured data against which non-detection may also cause legal problems. Therefore, the fields of knowledge management and data management to deal with the integration and management.

To provide the unstructured data with structures that exist in the open source framework the UIMA (Unstructured Information Management Architecture). This is a framework for creating applications for processing unstructured information.

Treatment of unstructured data

Especially for the structuring of the data, the following methods may be considered:

Database index Naive Bayes classifier Latent Semantic Analysis Data-Mining Data modeling

793826