XML-Retrieval

XML or XML retrieval information retrieval is the content -based retrieval of documents that are structured with the Extensible Markup Language (XML).

Inquire

Most approaches for XML retrieval based on techniques from the field of information retrieval (IR ) and calculate, for example, the similarity between a query consisting of keywords and the document. In XML retrieval, the request may in addition also contain structural hints. So-called content and structure (CAS ) queries allow the user to specify the XML structure that will contain the desired search term or can.

Use of XML structure

The self-describing structure of XML documents can be used to improve the search for XML documents in some cases considerably. This includes the use and exploitation of CAS requests, assigning different weights to different XML elements (so that, for example, a title element is weighted higher than a footnote ), or the focused retrieval of partial documents.

Ranking

The ranking, ie the relevance score of a document may be considered when XML retrieval both content and structure similarity, so the similarity between the structure that was specified in the CAS query and the structure to be assessed document. Moreover, the results of a structured query can be either a whole document, or any deeply nested XML elements of a document. The goal is to find the smallest result, which has the highest relevance, relevancy is to be understood as specificity, ie, as the extent to which the result is focused on the desired result.

XML search engine

The Initiative for the Evaluation of XML Retrieval ( INEX ) was founded in 2002 and provides a platform for the evaluation of such algorithms. Three areas affect XML Retrieval:

  • XML query languages ​​: query languages ​​like the W3C standard XQuery enable complex searches, but only exact matches are enabled, so not relevant calculation and no ranking of the results. They must therefore be extended to allow the vague search is possible by relevance calculation. Most XML-based approaches require an accurate knowledge of the documents underlying schema ( XML Schema or DTD).
  • Databases: Classical database systems now offer the option to save even semi-structured data, which has led to the development of XML databases. Often, such approaches are very formal, focus more on the search itself than on the ranking, and are intended for experienced users who can formulate complex queries.
  • Information Retrieval: Traditional information retrieval models such as the vector - space model based on relevance calculations, but use of any document structure, but allow only simple queries. Set further to a static document concept, so that the results usually consist of complete documents. However, they can be extended to allow structural information and dynamic document retrieval. Such approaches use document subtrees (index terms plus structure ) as the dimensions of the vector space.
17754
de