Document classification

Text classification is a very important criterion in the field of information extraction.

With differently structured text, various methods are applied, which differ from one another in characteristics such as complexity, constraints or the process of extracting. There are, for example: a language- based method ( Perl) or a wrapper induction - based method. Therefore, it is necessary to classify the parsed texts.

The texts are divided according to their structured nature:

  • Natural and unstructured plain text,
  • Structured information
  • Semi-structured texts.

Natural and unstructured plain text

The natural and unstructured plain texts are edited with systems that allow a morphological and syntactic analysis. The procedure is very costly and sometimes unnecessary because the information sought can be found using simple pattern.

Structured information

With the structured information it is all about tables and relational databases. Here, no linguistic analysis is required but to find the information you are looking, it only suffices to recognize the structure.

Semi-structured texts

The HTML documents are referred to as semi-structured texts and represent a major challenge for the information extraction systems; they do have a non-uniform structure, some are marked by the HTML tags, some are natural texts. To extract the information, the information extraction system to recognize the HTML structure and text. The HTML tags are an important clue to the structure.

766712
de