Vector space model

The vector space retrieval (English: Vector Space Model (VSM ) ) is a process for gathering information, in which the information can be represented as points in a high-dimensional metric vector space. To evaluate the mathematical distance between the search vector and the Dokument-/Informationsvektor is evaluated. The vector space model was first implemented in the SMART system, which was developed under the direction of Gerard Salton at Cornell University.

Simplified description

Simplistically, one can imagine that this form of information gathering underlying model as follows: each word of the document is assigned to a dimension. To determine the point of a document (or a query ) in this vector space can be counted in a very simple variant of the vector space model, for example, how many times each word occurs in the document. The point of the document in the vector space ( the document vector ) then corresponds to the frequencies of these words. For example, one could therefore, which consists of a policy document " The explosion destroyed the vegetation " as a vector (0, ..., 2, ..., 1, ..., 1, ..., 1, ... ) describe: The word that occurs twice, explosion destroyed and vegetation once each; other words do not occur (0 times).

Searches can be encoded in the same way; a fictitious search " Destroy the explosion, the vegetation? " correspond because of the same word distribution in this case, exactly the same ( query ) vector ( 0, ..., 2, ..., 1, ..., 1, ..., 1, ... ). The problem of finding documents that match as closely as possible to the search query, you can therefore be solved using the vector space model by looking for those documents whose vector "similar" is the vector of the query as possible. A simple way could be, for example, to find documents vectors lying parallel to the query vector or differ only by a small angle from him.

In reality, vector space models much more complex and take into account, for example, different word frequencies. Words like " the " or "is" occur for example in almost every German document and are therefore not very meaningful, whereas words such as " deoxyribonucleic acid " are rarer and thus potentially better suited to delineate the content document from others.

Method

To enable vector space retrieval, some preparatory work is needed. The first step consists in the construction of a document vector space and document indexing, in which the documents in the document set are mapped to exactly one point (document vectors) in the document vector space. To this end, there are a variety of feature weighting models, all based on the frequency of characteristics such terms, lemmas or n-grams in individual documents as well as the entire document set.

The retrieval in the vector space model first performs a query indexing, in which the request is mapped to a vector in the vector space. The subsequent retrieval function determines a subset of the document vectors, which have a certain similarity in the query vector and the ranking function maps this subset to an ordered list of document vectors from. The user, who has asked the query, a list of documents is presented, which corresponds to the list of document vectors.

799689