Document-Retrieval

Document retrieval refers to the computer-assisted process of recovery of documents ( engl. to retrieve, recover, retrieve ), which may be relevant to a user according to his information need. His information needs of the user presses in the form of a search query. Document retrieval is often also referred to as information retrieval, in most cases, the terms are used interchangeably.

In documents entrepreneurial memory is hidden. Poor access to the contents of these documents is poor access to the knowledge, which has produced an organization over time or possesses. Thus the Document retrieval is of enormous importance, as no longer accessible information need to be worked again.

  • 3.1 indexing
  • 3.2 retrieval

History

Even before the Middle Ages humanity organized information so that they could be retrieved at a later time and used. The simplest example is the table of contents of a book: it consists of sets of words or terms that are linked to the pages where information on these terms can be found. Such an index is part of each information system.

1945 Vannevar Bush described in his article As We May Think, the vision of a system that he called Memex, a kind of extension of the brain. This is an individual store all information and records and can retrieve it quickly and flexibly again.

Since the 1940s, the problem of storing information and efficiently locate again attention has been increasingly paid. The reason for this was that a rapid increase took place on information to which access was faster desired. The space needed to keep this information in paper form and in folders or offices, soon was no longer sufficient. It started the digitization of data, thereby avoiding the problems of efficient storage and retrieval took center stage. With the invention of the CD, a new way to store data compactly and to be able to simply distribute additionally opened. On methods for recovery has been researched, but tests in dimensions with commercial applicability found only a few instead. With the release of the Internet, the possibility was created exclusively for each user to publish information on the net. Modern search engines try this recent flood of information to master. The research considers itself ever since the first generation of document retrieval systems with the central question facing, which are the relevant information. An understanding of this issue and the necessary tools to Document retrieval systems for such amounts of information design and run them, but even at the beginning of the 21st century, not yet in full measure. Repeated incidents in companies that have lost because of a lack of document control large sums of money, confirm this.

First commercial document - retrieval systems were:

  • DIALOG was designed by Lockheed, and gave access to published research articles.
  • LexisNexis presented willing specialized databases.
  • STAIRS was developed by IBM and was intended for the free text search.
  • FAIRS has been developed by Fujitsu ( Japan), and similar STAIRS.
  • GOLEM is an interactive database system from Siemens
  • GRIPS was developed by the German Institute of Medical Documentation and Information (DIMDI).

Definition

Under a Document Retrieval System ( DRS ) the totality of the methodological foundations, technical methods and devices is understood that enable largely computerized providing information. This information may consist of sound, image, video and text. What is essential is the interaction of the components of the information retrieval ( indexing) and information retrieval ( retrieval ).

The presentation of substantive characteristics of a document in a usable form for document retrieval is referred to as document content description. The production of content such characteristics is called indexing. According to DIN 31623 can be understood by all indexing methods and their applications, which lead to the assignment of descriptors and terms to documents for the purpose of its content indexing and selective retrieval. The retrieval process is commonly referred to as research. The result of the research, ie the set of documents issued by the Document retrieval system is called system proposal.

As a parameter for the quality of the document retrieval recall and precision, the dimensions are usually used. Under Recall ( completeness of the search) is the ratio of the number of relevant documents in the system proposal to the number of all relevant documents with respect to the query. The Precision ( precision of the search) is expressed by the proportion of relevant documents to all documents in the system proposal. Since these values ​​alone what little they are often grouped in so called Recall -precision graph.

The relevance is considered a key concept of the theory of IR systems. According to Saracevic Relevance is a measure of the correlation between document and query from the perspective of a neutral arbitrator. The relevance of notions of the user ( also referred to as pertinence ) and the system agree rarely match. Here a central problem of the Document Retrieval is clear: It is before a search query ( especially at the time of indexing ) is not possible to determine what information will be relevant for future users.

More definitions

  • A DRS does not inform the user about the subject of his query. It only provides information about the existence or non- existence and location of documents that may be relevant to his search query.
  • A DRS includes the hardware and software that helps users to provide requested information it is available. Main goal of DRS is to minimize the effort required of the user to find the required information.
  • Document retrieval computerized means the process of recovery of documents. A user makes a request in the form of queries and receives a sorted list of documents by relevance. These documents could include the information that he is looking for (or not ). The sort of system proposal does not satisfy the user's relevance performances.

Demarcation for Data Retrieval

The following table shows the comparison of some differences of Document and the traditional data retrieval. For a detailed discussion of the differences and similarities between the interested reader is referred to.

In Data retrieval is usually according to an exactly specified object, for example " Bob's address ," sought. The result of the search is either the searched object ( Bob's address), or this is not present in the searched database. A corresponding query for such a query in SQL might look like this: SELECT address FROM employees WHERE NAME = Bob. This query is fully specified in an artificial language. You will be answered with either Bob's address or with a message that Bob's address does not exist in the dataset. The result of the search is only exactly correct when Bob's correct address was returned. The output of the search is deterministic: either the correct data is present or not.

In Document Retrieval is not wanted by Bob's address, but for example, for information about the environment, lives in the Bob. First, it is not clear how a query should look like that the user provides this information. For a possible query Bob address environment, the DRS provides suggestions that can then browse for useful information in the user. The information needs of the user is here expressed in natural language, but not completely specified. For a complete specification, the user would have to know what he is looking straight. It is also not clear which proposals are made by the DRS and whether it can provide the information requested and will. So here is based on a probabilistic model. Because of these uncertainties, a search result may not be designated as correct or incorrect. The documents presented to the user can be useful or useless to him. Accordingly, here is the criterion for success of a search of the user's benefit.

Building a Document Retrieval System

Indexing

Subject to indexation is to assign a set of index terms or keywords documents. Here are the index terms:

  • The content of the document as completely as possible reflect.
  • Describe the document so that it is possible very different from the content of similar documents.

These keywords can be manually generated by an indexer, either automatically, or. They provide a logical view of a document. The best way to represent a document, with its full content. But this leads to high memory space required for the index. He would then be the same as the documents which he indexed. Therefore, a document representation must be found that as completely fulfills the two requirements listed above. This process is usually of the following steps.

First special characters according to prescribed rules and common words such as articles and connecting words are removed using a stop list. A stop list containing all the words that are irrelevant for a description of the contents of the document and are removed from the text. These are then not be included in search queries and thus simplify the search process. In addition, the size of the original document is reduced by 30-50% by this step.

Then all words are reduced to their root word, by their suffixes are removed (so-called stemming ). Thus, all the words that are semantically equivalent, mapped to the same root, for example, the terms are drivers drive, driving school and ready to ride. The adoption of Stemming is that words with the same stem belong to the same word family and therefore can be treated as equal. However, this simplification can also lead to errors, as well words with the same root word but there are varying degrees of importance, such as neutron and neutralize. Moreover, in different contexts have different meanings equivalent words. The result of this processing step is a class for each root word. If a word in a document in front of a class, so the document is assigned to this class as the keyword.

Finally, all index terms are weighted according to the one implemented in the DRS model. Then an index is created that a quick search in the set of index terms by allowing these are linked to the documents in which they are contained. If necessary, other important information such as the position of the term in the document or the author can be stored. A frequently encountered index structure is the inverted file. Other data structures and their descriptions as sequential files, index -sequential files and multi -lists can be found in Chapter 4 in.

It can also be used clustering, where similar documents are assigned to a cluster. The search in such a pre-classified information component is called Cluster Search and proceeds in two steps. Initially, only clusters with high relevance are looking for. Then the documents are inspected in these clusters and the most relevant retrieved. By clustering the efficiency of document retrieval systems is to be increased by reducing the necessary document comparisons. It is obvious that it can thereby reduce its effectiveness.

Retrieval

The process of locating the information that would like to receive a user consists of several steps. First, it must be need for information in a form understandable for the search engine, called a query transform. This query is finally transferred to a query representation. Most of the processes that go through the documents during indexing, also passes through a query. All operations described below expire while the user waits for a response to its query. Initially, such as " I am looking for information about: " to search irrelevant terms and signs removed. Then irrelevant terms are removed and done Stemming using the stop list also. Finally, the query representation is generated, where necessary logical operators can also be inserted for the search algorithm. It is also possible to expand the terms of the queries and to include with such related terms that are associated with the searched term in the search. These related terms may be synonymous terms that are found in electronic thesauri, or are related to the query term due to semantic properties (eg certain word order ) in a special connection. This processing step frees the user from the need to try all variants of its queries to as many relevant for him to get in the search results. Thus may the Recall is increased, but the precision will decrease when expanded terms lead to the recovery of irrelevant documents.

Finally, the actual search is performed. The search algorithms used are given by the implemented model of the DRS. The index is searched for documents that contain terms of the queries. For each document, the so-called similarity score is calculated with the Query. The calculation is performed with an algorithm, which is also given by the implemented model of the DRS. Then the sorting or the ranking of the documents according to their similarity scores is done. The sorted list (possibly with a brief description of each document ) made ​​available to the user. He can look at the list, or the contents of the document in more detail. Some systems also offer the possibility of user-based relevance feedback so that the user can select for him relevant documents. The system then initiates a new search procedure based on these reviews and provides a revised list of documents that (hopefully ) more relevant documents to the user contains. The process of relevance feedback can be performed as often as desired.

Theoretical Document Retrieval models

The following theoretical models are implemented in document retrieval systems. The choice of the model has implications for search algorithms, and the calculations of the rankings and scores. In Chapter 2, these are described in detail.

Classical models:

  • Boolean model
  • Vector space model
  • Probabilistic model

Modern probabilistic models:

  • Bayesian Networks

Alternative paradigms:

  • Extended Boolean model
  • Generalized vector space model
  • Semantic indexing
  • Neural Networks
  • Fuzzy retrieval
243738
de