Information Retrieval

Information Retrieval [ ˌ ɪnfɚmeɪʃən ɹɪtɹi ː vəl ] (IR ) and information retrieval, occasionally inaccurate information retrieval, is a specialty that deals with computer-aided search for complex content ( ie, for example, no single words) and falls in the fields of information science, computer science and computational linguistics. As is evident from the word importance of retrieval ( German demand, recovery), are complex text or image data, which are stored in large databases, initially not accessible or available to outsiders. In Information Retrieval it comes to finding existing information, not to discover new structures (such as the Knowledge Discovery in Databases to which the data mining and text mining belong ).

  • 3.1 Information Requirements
  • 3.2 Information needs
  • 3.3 Information Indexing and Information Retrieval
  • 3.4 Documentary reference unit and documentation unit
  • 3.5 Cognitive Models
  • 3.6 pull and push services
  • 3.7 information barriers
  • 3.8 Recall and Precision
  • 4.1 usefulness
  • 4.2 Aspects of Relevance
  • 4.3 Binary approach
  • 4.4 Relevance distributions 4.4.1 Binary Distribution
  • 4.4.2 inverse logistic distribution
  • 4.4.3 informetric distribution
  • 5.1 Textual and non-textual objects
  • 5.2 Formal published text documents
  • 5.3 Informally published texts
  • 5.4 Unpublished texts
  • 5.5 Non -textual documents
  • 6.1 structure of texts
  • 6.2 Retrieval systems and terminological control
  • 6.3 Information Linguistic Text Processing
  • 7.1 Boolean Model
  • 7.2 Text Stats
  • 7.3 Linktopologische models
  • 7.4 Cluster model
  • 7.5 User- usage model
  • 7.6 Oberflächenweb and Deep Web
  • 8.1 Character Sets
  • 8.2 New documents in the data base
  • 8.3 crawler 8.3.1 Best -First crawler
  • 8.3.2 Crawling the Deep Web
  • 8.3.3 FIFO (first in first out) crawler
  • 8.3.4 Thematic Crawler

Scope

Methods of data retrieval are in Internet search engines (eg Google), but also in digital libraries (eg for literature search ) and used in image search engines. Also response systems or spam filters use IR techniques.

The problem of access to information stored complex lies in two phenomena:

In general, the two IR ( under certain circumstances overlapping) groups of people involved (see figure at right).

The first group of people are the authors of the data stored in an IR system information, which you can store either by itself, or can be read from other information systems (as for example, the internet search engine practice ). The set in the System Documents are converted by the IR system according to the system 's internal model of the representation of documents in a form convenient for processing (document representation ).

The second user group, users have certain, at the time working on the IR system acute goals or tasks for which solution they lack information. This information needs want to cover with the help of the system user. For this, they need to formulate their information needs in an adequate form as requests.

The form in which the information needs to be defined that depends on the used model of representation of documents. As the process of modeling the information needs expires as interaction with the system (eg as a simple input of search terms ), is determined by the model of interaction.

Are the requests formulated, then it is the task of the IR system to compare the requests with the set in the system documents using the document representations and return a list of matching documents to the requests of the users. The user is now faced with the task of evaluating the retrieved documents according to its task on solving relevance. The result are the reviews to the documents.

Then the users have three options:

  • You can (usually only in a narrow frame) modifications to the representations of the document data (for example, by defining new keywords for indexing a document).
  • You can improve your formulated questions (mostly to the search result further restrict )
  • Change their information needs because they find after performing the research that they need more, not previously classified as relevant information to solve their tasks.

The exact sequence of the three forms of modification is determined by the model of interaction. For example, there are systems that support the user in reformulating the request by reformulating the request using explicit (that is, the system by the user in any way communicated ) document reviews automated.

History

The term " information retrieval " was first used in 1950 by Calvin N. Mooers. Vannevar Bush described in an article in 1945 how they could revolutionize the use of existing knowledge through the application of knowledge storing. His vision was called Memex. This system should store all kinds of knowledge of transport and enable via links targeted search, browse for documents. Bush was already thinking about the use of search and retrieval tools. A decisive boost received the information science through the Sputnik shock. The Russian satellite held the Americans on the one hand their own backwardness in space research in mind, which was successfully removed by the Apollo program. On the other hand - and that was the key point for Information Science - it took half a year, the signal code of the Sputnik to crack. And, although the decryption code was read in a Russian magazine a long time ago, which was already in American libraries. More information does not lead to more informed. On the contrary. The so-called Weinberg report is a given by the President report commissioned for this problem. The vineyard -Report tells of an "information explosion" and declares that experts are needed that can solve such information explosion. So information scientists. Hans Peter Luhn worked in the 1950s to text statistical methods, which constitute a basis for automatically summarizing and indexing. His goal was to create individual profiles and information highlight search terms. The idea of ​​the push service was born. Eugene Garfield worked in the 1950s to Zitierindices in order to reflect the various routes of transmission of information in magazines. To this end he copied contents. In 1960 he founded the Institute for Scientific Information (ISI ), one of the first commercial retrieval systems.

Germany

In Germany, Siemens has developed two systems GOLEM (large memory -oriented, list -organized discovery method ) and PASSAT ( program for automatic selection of keywords from texts). PASSAT operates to the exclusion of stop words, word forms strains using a dictionary and weights the search terms.

Since the 1960s, information science is considered to be established.

Early commercial information services

DIALOG is a process developed by Roger K. Summit interactive system between man and machine. It is business-oriented and goes 1972 on the government databases ERIC and NTIS online. The ORIBIT project (now Questel - Orbit ) has been driven by research and development under the direction of Carlos A. Cuadra. 1962 is the retrieval system CIRC online and various test runs will take place under the code name Colex. Colex is the direct precursor of orbit, which goes with a focus on research by the U.S. Air Force online 1967. Subsequently, the focus shifted to medical information. The search system is Medline 1974 for bibliographic medical database MEDLARS online. Obar is a 1965 losgetretenes of the Bar Association in Ohio project. It ends in the system LexisNexis and recorded focus right information. The system is based on the full text search, which works perfectly for the Ohio - judgments.

Search Tools on the World Wide Web

With the Internet, information retrieval will become a mass phenomenon. A precursor was the widespread system from 1991 WAIS, which enabled distributed retrieval on the Internet. The early Web browser NCSA_Mosaic and Netscape_Navigator support the WAIS protocol before the internet search engine came up and later went on to to index non - HTML documents. Among the best known and most popular search engines currently are Google and Bing. Common search engines for intranets are Autonomy, Convera, FAST, Verity, as well as open source software Apache Lucene.

Basic concepts

Information needs

The need for information is the need for handlungsrelevantem knowledge and can be specific and problem-oriented. When specific information needs of a factual information is needed. So, for example, "What is the capital of France? ". The answer "Paris" covers the information needs completely. It is different when problem-oriented information needs. Here are several documents needed to satisfy the demand. In addition, the problem-oriented information needs will never be completely covered. Optionally is derived from the information received even a new demand or the modification of the original demand. When information needs are abstracted from the users. That is, it is the objective facts considered.

Need for information

The need for information reflects the specific needs at the requesting user. It involves a subjective need of the user.

Information Indexing and Information Retrieval

To be able to formulate a query as precise as possible, you would have to actually know what you do not know. So it must be present in order to submit a search query adequate basic knowledge. In addition, the natural language query must be converted into a version that can be read by the retrieval system. Here are some examples of search request formulations in various databases. We are looking for information about the actor " Johnny Depp " in the film " Chocolat ".

LexisNexis: HEADLINE: ( " Johnny Depp " w / 5 " Chocolat " )

DIALOGUE: (Johnny Depp ADJ AND Chocolat ) ti

Google: " Chocolat " " Johnny Depp "

The user is assumed to reflect how the retrieval process works, and indeed so. Due to the way its search request formulation used in each system A distinction must be word - and concept-oriented systems. Term -oriented systems can recognize the ambiguities of words (eg Java = the island Java or Java coffee = the = the programming language). About the query, the documentation ( DE) is addressed. DE represents the informational added value of the documents Represents the means in DE is information about the author, born reproduced etc. compacted. Depending on the database either the entire document or only parts are covered by it. For example, a book as a whole or only the chapters of this book.

Documentary reference unit and documentation unit

Neither the documentary reference unit (DBE ) nor the documentation ( DE) are the original document. Both are only representative of the same in the database. The first Dokumentationswürdigkeit a document is checked. This takes place through formal and content criteria catalogs. If an object is found documents worthy of a DBE is created. This decides the form in which the document is stored. If individual chapters or pages taken as a DBE or the document as a whole? This is followed by the practical information process. The DBE will be formally described and compresses the content. This informational added value is then found in DE again, which serves as a proxy for the DBE. DE represents the DBE and is therefore at the end of the documentation process. The DE serves the user to make a decision about whether he can use the DBE and requests or not. Information Retrieval and Indexing information are coordinated.

Cognitive models

These are part of the empirical information science as they relate to the knowledge, socio -economic background, language skills, etc. of the users and do about information needs, usage and user analyzes.

Pull and push services

Searching for information describes Marcia J. Bates Berry Picking (Eng. berry picking ). It is not enough to look only at one shrub respectively a database of berries or information so that the basket is full. It must be requested multiple databases and the query are constantly modified based on new information. Pulldienste be everywhere provided, where the user can actively search for information. Push services provide the user due to information stored profile information. This profile services, so-called alerts, save successfully formulated queries and inform the user of the arrival of new relevant documents.

Information barriers

The flow of information hinder various factors. Such factors as time, location, language, laws and funding.

Recall and Precision

The Recall refers to the completeness of the displayed hits. The Precision, however, calculated the accuracy of the documents from the number of hits to a query. Precision refers to the proportion of all relevant documents to the selected documents of a query and is thus the measure of respect to the task meaningful documents in the hit list. Recall, however, describes the proportion of all relevant documents to the total number of relevant documents in the document collection. It is the measure of the completeness of a hit list. Both measures provide crucial metrics for an information retrieval system. An ideal system would select all relevant documents in a document collection to the exclusion of non-applicable documents in a search query.

Recall:

Precision:

A = found, relevant results

B = found not relevant EN / ballast

C = relevant DE, which were not found / loss

"C" is not directly measurable, since one can not know how many GB can not be found, if one does not know the contents of the database or the DE, which actually should have been displayed due to the search query. The recall can be increased at the expense of precision and vice versa. This is not true for a fact question. Here are recall and precision equal to one.

Relevance and pertinence

Knowledge can be relevant, but need not be pertinent. Relevance means that a document was printed, under the matching query, that has been formulated. If the user already knows but the text or he does not want to read it because he does not like the author or has no desire to read an article in another language, the document is not pertinent. Pertinence refers to the subjective view of the user with a.

Requirements for successful information retrieval are the right knowledge at the right time, the right place, in the right amount, in the right form, with the right quality. Where " right" means that this knowledge has either pertinence or relevance.

Usefulness

Knowledge is useful when the user generates a new action- relevant knowledge and this translates into practice.

Aspects of relevance

Relevance is the relation between the query (query) in relation to the subject and the system-side aspects.

Binary approach

The binary approach indicates that a document is either relevant or non- relevant. In reality, this is not necessarily true. Here we speak rather of "relevance " regions.

Relevance distributions

For topics chains can be formed for example. A theme can occur in several chains. The more common an issue, the greater its weight value. If the subject in all chains before, its value is 100; it happens in any chain, with a 0 for investigations, three different distributions have emerged. It should be noted that these distributions come only with larger volumes of documents about. For smaller quantities of documents, there are no regularities policy.

Binary distribution

In binary distribution no Relevanceranking is possible.

Inverse logistic distribution

  • Rank Place
  • : Euler's number
  • : constant

Informetric distribution

  • Rank Place
  • : constant
  • : Concrete value 1-2

The informetric distribution says: If the top ranked document has a relevance of one (in the case ), then the runner-up document has a relevance of 0.5 ( in ) or 0.25 ( in ).

Papers

It should again be noted that a distinction is made in the information science between the output document of the DBE and DE. But when is " something " is actually a document? The decision four criteria: the materiality (including the digital presence ), intentionality ( The document bears a certain sense, a meaning ), the development and perception.

Textual and non-textual objects

Objects can occur in text form, but do not have it. Images and movies are examples of non-textual documents. Textual and non-textual objects can occur in digital and non- digital form. If they are digital and take more than two forms of media to each other ( A document consists for example of a video sequence, a sequence audio and images ), they are called multimedia. The non- digital objects present in the database need a digital substitute, such as a photo.

Formally published text documents

As formally published text documents, all documents are referred to, which have gone through a formal publication process. This means that the documents were checked before being released (eg by an editor ). One problem is the so-called " gray literature " dar. This is indeed checked, but not published.

There are several levels of formal published documents. It starts with the work, the creation of the author. Followed by the expression of this work, the specific implementation (eg different translations). This implementation is manifested (such as in a book ). At the lowest point of this chain is the item that single copy. In general, the DBE toward manifestation. However, exceptions are possible.

Informally published texts

Among the informal texts published mainly include documents that were published on the Internet. These documents are indeed published, but not tested.

Unpublished texts

These include letters, invoice, internal reports, documents, intranet or extranet. Eben all documents that were never made ​​public.

Non -textual documents

In the non-textual documents there are two groups. Firstly, the existing digital or digitalised documents, such as movies, photos and music, and on the other the non-digital and non- digitalised documents. The latter include facts, such as chemical substances and their properties and reactions, patients and their symptoms and museum objects. Most non- digitalised documents are taken from the disciplines of chemistry, medicine and economics. They are represented in the database by the DE and often additionally illustrated by pictures, videos and audio files.

Typology of retrieval systems

Structure of texts

A distinction between structured, weakly structured and non - structured texts. For the weakly structured texts include all types of text documents that have a certain structure. These include chapter numbers, titles, subheadings, illustrations, page numbers, etc. Over informational added values ​​to the texts structured data is added. Non - structured texts come in reality hardly ever. In information science, one is mainly concerned with weakly structured texts. It should be noted that it is only formal, not to syntactic structures. This results in a problem with the context of meaning of the content.

"The man saw the pyramid on the hill with the telescope. " This sentence can be interpreted in quadruplicate. Therefore, some providers prefer human indexers, since they can recognize the meaning of the context and process it correctly.

Information retrieval systems can operate either with or without terminological control. Work with terminological control, the possibilities are both intellectually as well as to index automatically. Retrieval systems which work without terminological control, either edit the plain text or the process runs on automatic processing.

Retrieval systems and terminological control

Terminological control means nothing more than the use of controlled vocabulary. This is done through documentation languages ​​( classifications, keyword method, thesauri, ontologies ). The advantages are that the searcher and the indexer on the same terms and formulation options available. Therefore, no problems with synonyms and homonyms arise. Disadvantages of controlled vocabulary are about the lack of consideration of language development, as well as the problem that this art languages ​​are not correctly applied by any user. A further factor is of course the price. Intellectual indexing is much more expensive than automatic.

In total, four cases can be distinguished:

In the variant without terminological control is best worked with the full texts. This only works for very small databases. The terminology of the documents must be known by the users accurately. The process of terminological control presupposes a linguistic information processing ( NLP = Natural Language Processing ) of the documents.

Information Linguistic Text Processing

The information linguistic text processing proceeds as follows. First, the writing system is detected. Example, is it a Latin or Arabic writing system. Thereafter, the speech recognition follows. Well text, layout and navigation are separated. At this point there are two possibilities. First, the decomposition of words in n-grams or word recognition. No matter which method you decide to stop word mark, input error detection and correction as well as named entity recognition and the formation of basic or root forms close to. There are decomposed compound words, homonyms and synonyms detected and compared and examined the semantic environment or the environment according to similarity. The last two steps are the translation of the document and the Anaphoraauflösung. It may be necessary that the system interacts with the user in the connection during the process.

Retrieval models

There are several competing retrieval models, but by no means have to be mutually exclusive. These models include the Boolean and extended Boolean model. The vector space model and the probabilistic model are models based on the text statistics. The link topological models include the Kleinberg algorithm and PageRank. Finally, there is the network model and the user / usage models that examine the text and use the user at his specific location.

Boolean model

George Boole published in 1854 his "Boolean logic " and their binary view of things. His system has three functions or operators: AND, OR and NOT. In this system, no sorting by relevance is possible. To enable a relevance ranking, the Boolean model has been extended by weighting values ​​and the operators are reinterpreted had.

Text Stats

In the text statistics, the terms occurring in the document are analyzed. The weighting factors are called here WDF and IDF.

Within -document frequency ( WDF ): Number of occurring terms / total number of words

The WDF describes the frequency of a word in a document. WDF his Depending Frequent occurrences of a word in a document, the greater

Inverse document frequency inverse document frequency weight English (IDF ) Total number of documents in the database / number of documents containing the term

The IDF describes the frequency with which the occurrences of a document with a specific term in a database. Be more frequent occurrences of a document with a specific term in the database, the smaller IDF.

The two classical models of text statistics, the vector space model and the probabilistic model. In the vector space model clamp n- words on an n -dimensional space. The similarity of words to each other is calculated from the angle of their vectors each other. In the probabilistic model, the probability is calculated using the true a document to a query. Without additional information, the probabilistic model is similar to the IDF.

Linktopologische models

Documents on the WWW under each other and linked together. They thus form a space of links. The Kleinberg algorithm calls these links "hub" ( outbound links ) and "Authority " ( inbound links ). The weighting values ​​arise over how hubs meet "good" Authorities and Authorities of the "good" hubs are linked. Another linktopologisches model is the PageRank of Sergey Brin and Lawrence Page. It describes the probability of randomly Surfing takes a page.

Cluster model

Clustering methods attempt to classify documents, so that similar or interrelated documents are combined in a common document pool. This enters an acceleration of the search process, since all the relevant documents can be selected in the best case with a single access. In addition to documents similarities but also play as synonyms semantically similar words a significant role. So should present " word " is also a hit list for comment, remark, statement or term is a search for the term.

Problems arise from the nature of the grouping of documents:

  • The cluster must be stable and complete.
  • The number of documents in a cluster, and thus the resulting hit list can be very high with homogeneous documents for specific documentation. In the reverse case, the number of the clusters grow to the extreme case where there are clusters of only one document at.
  • The overlap rate of the documents that are located in more than one cluster, is hardly controllable.

User- usage model

In the User- usage model, the frequency of use of a website is a ranking criterion. In addition, background information flow, for example, on the user's location in geographical queries with a.

When systematic search to feedback loops arise. These run either automatically or the user will be prompted repeatedly to mark results as relevant or non- relevant, before the search is modified and repeated.

Oberflächenweb and Deep Web

The Oberflächenweb located in the web and is free for all users accessing. In Deep Web are about databases with search interfaces can be accessed via the Oberflächenweb. Your information, however, with costs generally. We can distinguish three types of search engines. Search engines like Google are working algorithmically, the Open Directory Project is a web catalog intellectually created and metasearch engines derive their content from several other search engines, which respond to. In general, use intellectually created web directories only the initial page of a website as a reference source for DBE. In algorithmically working every website search engine is used.

Architecture of a Retrieval System

There are digital and non-digital storage media, such as steep cards, library catalogs and Sichtloskarten. Digital storage media are developed by the computer science and employment field of information science. One differentiates between the file structure and its function. In addition, there are interfaces of the Retrieval System with the documents and with their users. The interface between the system and document, there are three areas again. Searching for documents, called the crawl, the control of these documents found on updates and classification in a box scheme. The documents are either intellectually or automatically captured and processed. The DE are stored twice. Once a document file and additionally as an inverted file, which is designed to facilitate access to the document file as a tab or index. Users and system occur in the following manner in touch. The user first written a query formulation, receives 2 a hit list, can be 3 to display the documentation units and processes it locally 4 on.

Fonts

In 1963, the ASCII code ( American Standard Code for Information Interchange). His 7 - bit code could capture and represent 128 characters. He was later extended to 8 bits ( = 256 characters). The biggest Unicode character set includes 4 bytes, ie 32 bits and is designed to represent all characters that are used at all in the world. The ISO 8859 (International Organization for Standardization) also regulates language-specific variants, such as the "ß" in the German language.

New documents in the data base

New documents can be added both intellectually as well as automatic data base. In the intellectual New documents indexer is responsible and decides which documents are included as. The automatic process is done by a "robot " or " crawler". It is based on a known amount of Web documents, called a "seed list". The links all web pages containing this list is now up to the crawler. The URL of the respective pages is checked if it already exists in the database or not. In addition, mirrors and duplicates can be detected and deleted.

Crawler

Best -First crawler

One of the Best -First crawler is the Page Rank Crawler. It sorts the links on the number and popularity of the incoming pages. Two more are the Fish -Search and the Shark -Search engine crawlers. The former confined his work to areas on the web, in which relevant pages focus. The Shark -Search crawler refined this method by, for example, he pulls additional information from the anchor texts to make a relevance judgment. Each site operator has the ability to close his side against crawlers.

Crawling the Deep Web

For a crawler can operate successfully in the Deep Web, it must meet several requirements. For one, he must search the database "understand" in order to formulate an appropriate query can. Moreover, he must understand hit lists and documents to view. This only works in free databases. Important for Deep Web crawler is that they can formulate search arguments such that all documents in the database are displayed is. Is in the search box vintage box available, the crawler the series would after all vintages request to access to all documents. The keyword field is an adaptive strategy makes the most sense. Once the data is collected, the crawler updates of pages found only needs to capture. To keep the DE as current as possible, there are several possibilities. Either the pages are visited the same distance regularly, which would, however, exceed the resources far and therefore is impossible, or the visit by Random, but what works rather suboptimal. A third possibility would be the visit to priorities. For example, according to the beat of their changes ( page centered) or the frequency of their calls or downloads ( user- centered). Other objects of the crawlers are to recognize spam, duplicates and mirror. The detection of duplicates is done in general by comparing the paths. Avoiding spam is a little more difficult as spam occurs often hidden.

FIFO (first in first out) crawler

The FIFO crawlers include the breadth-first crawler, which follows all links on a page, this is processing and the links of the pages found on this and the depth-first crawler. This works in the first step as the breadth-first crawler, meets the second step, however, a selection, which links it follows on and what not.

Thematic Crawler

Thematic crawler specialize in one discipline and therefore suitable for experts. Thematically not relevant sites to be identified and " tunneled ". Nevertheless, the links these tunneled pages are pursued in order to find more relevant pages. Distiller find Meanwhile, a good starting point for crawling by making use of taxonomies and sample documents. Classifier elicit these pages for relevance. The whole process is semi- automatic since taxonomies and sample documents need to be updated regularly. In addition, a concept system is needed.

Storing and indexing

The documents found are copied into the database. For this, two files are created, one for the document file, on the other hand, an inverted file. In the inverted file all words or phrases are ranked and listed by alphabet or another sorting criterion. Whether one uses a word or a phrase index index depends on the field. When an author field, for example, the phrase index is much better suited than the word index. In the inverted file to find information about the position of the words or phrases in the document structure information. Structural information can be useful for the Relevanceranking. If is about to indicate that a word has been written bigger, you can this weight is also higher. The words and phrases are both written in the correct order, as well as stored backwards. This enables an open link structure. The storage of the inverted file is in a database index.

Classification of retrieval models

A two-dimensional classification of IR models shown in the following figure. The following properties can be observed in the different models depending on your position in the matrix:

  • Dimension: mathematical foundation Algebraic models represent documents and queries as vectors, matrices or tuples that are transferred for calculating pairwise similarities over a finite number of algebraic computation operations in a one-dimensional similarity.
  • Amount Theoretical models are characterized by the fact that they map natural language documents on quantities and return the similarity determination of documents ( primarily ) on the application of set operations.
  • Probabilistic Models of the process of document search and the determination of document similarity as a multi-stage random experiment. For mapping of document similarities is therefore based on probabilities and probabilistic theorems ( in particular Bayes' Theorem ) as possible.
  • Dimension: characteristics of the model Models with immanent Terminterdependenzen distinguished by the fact that they take into account existing interdependencies between terms and thus them - in contrast to the models without Terminterdependenzen - not the implicit assumption is that terms are orthogonal or independent. Models with immanent Terminterdependenzen distance themselves from the models with the transcendent Terminterdependenzen thus from that the degree of interdependence between two terms from the document collection, is derived in a place designated by model way - so the model inherent ( intrinsic ) is. The interdependency between two terms is derived directly or indirectly from the co-occurrence of the two terms in this class of models. Under co-occurrence is understood as the common occurrence of two terms in a document. This class of model is therefore based on the assumption that two terms are mutually interdependent if they frequently occur together in documents.
  • Models without Terminterdependenzen are characterized by the fact that two different terms are considered to be completely different and in no way connected. This situation is often referred to in the literature as the orthogonality of terms or as independence of terms.
  • As with the models with intrinsic Terminterdependenzen, even the models lies with transcendental Terminterdependenzen no assumption on the orthogonality or independence of terms based. In contrast to the models with intrinsic Terminterdependenzen the interdependencies between the terms can not be derived solely from the document collection and the model with the models with transcendent Terminterdependenzen. That is, the underlying logic of the Terminterdependenzen than the model also continuous ( transcendental ) will be modeled. This means that in the models with transcendent Terminterdependenzen the presence is explicitly modeled by Terminterdependenzen, but that the concrete form of a Terminterdependenz between two terms must be specified from outside (eg from a human) directly or indirectly.

Information retrieval has cross-references to various other fields, such as probability theory, computational linguistics.

412714
de