Latent Semantic Analysis

Latent Semantic Indexing ( LSI short ) is a ( patented ) method of information retrieval that. Deerwester et al in 1990 first described by mentioned. A method such as the LSI, in particular for searching large volumes of data such as the Internet of interest. The target LSI is to find the main components of documents. These main components ( concepts ) can be thought of as general terms. So horse is, for example, a concept that includes terms such as jade, nag or Gaul. Thus, this method is as capable of a lot of documents ( such as those on the Internet are ), find the ones that where it comes to cars, even if the word car is not explicitly present in them. In addition, LSI can help articles, where it really comes to cars, to be distinguished from those in which only the word car is mentioned (such as at sites where a car is touted as profit).

Mathematical Background

The name LSI means that the term frequency matrix (hereinafter TD matrix) approximated by the singular value decomposition, and so approximated. In this case, a dimension reduction is to the semantic units ( concepts ) performed a document, which facilitates the further calculation.

LSI is just an additional method that is based on the vector space retrieval. The well-known from thence TD matrix is ​​also operated by the LSI to reduce them. This is useful in particular for larger document collections, since the TD- matrices are generally very large. To TD matrix is decomposed on the singular value decomposition. Then "unimportant " parts of the TD- matrix are truncated. This reduction helps complexity and computation time during the retrieval process (comparison of documents or requests ) to save.

At the end of the algorithm is a new, smaller TD matrix in which the terms of the original matrix are TD generalized to concepts.

The Semantic Space

The TD- matrix is split on the singular value decomposition to matrices from its eigenvectors and eigenvalues. The idea is that the TD- matrix (representative of the document ) from major dimensions ( important for the meaning of the document words) and less important dimensions (for the meaning of the document relatively unimportant words) exists. The former are to be retained thereby, while the latter can be neglected. Also here concepts, namely in meaning similar words (in the ideal case of synonyms), are summarized. So LSI generalizes the meaning of words. Thus, the usual number of dimensions may be significantly reduced by using only the concepts are compared. If there were the examined documents ( texts) from the four words horse, horse, floodgates, so horse and rider being integrated into one concept, as well as door and gate to another. The number of dimensions is thereby reduced from 4 ( in the original TD- matrix ) to 2 ( in the generalized TD- matrix). One can well imagine that with large TD- matrices, the savings are enormous in favorable cases. These dimensionally reduced, approximating TD matrix is referred to as semantic space.

Algorithm

  • The term - document matrix is calculated and, if appropriate, weighted, for example, by tf - idf
  • The term - document matrix is then decomposed into three components ( singular value decomposition ):
  • You can now control the dimensional reduction over the eigenvalues ​​in the generated matrix. This is done by successive elimination of the smallest eigenvalue in each case up to an indefinite border.
  • Order ( for Query) to process a query, it is displayed in the semantic space. is regarded as a special case of a document size. The (possibly weighted ) is mapped query vector using the following formula:
  • Each document is represented as in the semantic space. Then may be compared, for example, on the cosine similarity or the dot product with the document.

Advantages and disadvantages of the method

The semantic space ( ie the on the meanings reduced TD- matrix ) reflects the documents underlying structure resist whose semantics. The approximate position in the vector space of the vector space retrieval is kept. The projection of the eigenvalues ​​is then belonging to a concept (steps 4 and 5 of the algorithm). The Latent Semantic Indexing elegantly solves the synonym problem, but only partially polysemy, that is, that the same word can have different meanings. The algorithm is computationally very intensive. The complexity of the singular value decomposition, the number of documents number of terms and the number of dimensions. This problem can be bypassed by using the Lanczos method is used to economically calculate a reduced from the outset TD matrix. The singular value decomposition must also be always repeated when new terms or documents. Another problem is the dimension problem: how many dimensions of the term - document matrix is to be reduced, so is the size.

500477
de