Search engine (computing)

A search engine is a program for research of documents such as the World Wide Web are stored in a computer or computer network. Internet search engines have their origin in information retrieval systems. To create a keyword index for the document base to answer queries on keywords with a relevance ordered hit list. After entering a search term, a search engine returns a list of references to potentially relevant documents, most often shown with the title and a short summary of the document. This can apply various search methods.

The essential components or tasks of a search engine are:

  • Creation and maintenance of an index (data structure with information about documents )
  • Processing of queries (finding and ordering of results ) and
  • Preparation of the results in a useful form possible.

In general, the data collection is done automatically on the WWW Web crawlers on a single computer by regularly reading all files in user-specified directories in the local file system.

  • 6.1 topics within search engines
  • 6.2 Operation of search engines
  • 6.3 Algorithmic Foundations

Features of search engines

Search engines can be categorized according to a number of characteristics. The following features are largely independent. One can in the design of a search engine so deciding which option from each of the feature groups, without affecting the choice of the other features.

Type of data

Different search engines can search many types of data. First, these roughly into " document types " such as text, image, sound, video and other divide. Results are designed in response to this genus. When searching for text documents usually a piece of text is displayed containing the keywords ( commonly called snippet ). Image search engines display a thumbnail of matching images. A large proportion of all searches on the Internet refers to date on people and their activities. A people search engine finds publicly available information about names and people that are displayed as a link list. Other specialized types of search engines are for example job search engine, industry or product Search search engines. The latter are mainly used online price comparisons, but there are also already offer local search, which represent products and offers stationary retailers online.

A further increase in detail goes into data-specific properties that are not shared by all documents within a genus. If one stays in the sample text, so can at Usenet contributions by certain authors are searched for web pages in HTML format according to the document title.

Depending on the data type is possible, as a further function of a limitation on a subset of all data of one kind. This is generally achieved via additional search parameters, which excludes a portion of the collected data. Alternatively, a search may be limited to include only the appropriate documents from the beginning. Examples include a search engine for logs ( instead of the entire Web) or search engines that process only documents from universities, or only documents from a specific country in a specific language or a specific file format.

Data Source

Another feature for categorization is the source from which derived the data collected by the search engine. Mostly already the name of the Suchmaschinenart describes the source.

If the data collection done manually by application or by lecturers, one speaks of a catalog or directory. In such directories such as the Open Directory Project, the documents are hierarchically organized in a table of contents by topic.

Realization

This section describes the differences in the implementation of the operation of the search engine.

  • Today's most important group are index-based search engines. This reading a matching documents and create an index. This is a data structure that is used in a subsequent query. Disadvantage is the expensive maintenance and storage of the index advantage is the acceleration of the seek operation. Most common form of this structure is an Inverted Index.
  • Send metasearch searches in parallel on several index-based search engines and combine the individual results. As an advantage, there are the larger amount of data and the simpler implementation, since no index must be maintained. Disadvantage is the relatively long duration of the query processing. In addition, the ranking is by pure majority in of questionable value. The quality of results is reduced under certain circumstances on the quality of the worst surveyed search engine. Metasearch engines are especially useful in the rare occurring search terms.
  • Furthermore, there are hybrid forms. These have their own, often relatively small index, but also ask other search engines and then combine the individual results. So-called real-time search engines start about the indexing process only after a request. Thus, the found pages are indeed up to date, the quality of the results is, however, due to the lack of broad database especially for less common keywords bad.
  • A relatively new approach are Distributed search engines or Federated search engines. In this case, a search request to a plurality of individual computers is transmitted, each running its own search engine, and the results combined. Advantage is the high fault tolerance due to the decentralization and - depending on the perspective - the lack of ability to centrally censor. However Difficult to solve is the ranking, ie the sort of basically matching documents according to their relevance to the inquiry.
  • A special kind of distributed search engines are the based on the peer - to-peer principle, build a distributed index. On each of these peers independent crawlers to censorship resistant detect the parts of the Web, which defines the respective peer operators through easy local configuration. The best known system is, in addition to some predominantly academic projects (eg Minerva ), free under GNU GPL software YaCy.

Interpretation of the input

A user's query is interpreted before the actual search and put in a form understandable to the search algorithm used internally. This serves as simple as possible to keep the syntax of the request and still allow complex queries. Many search engines support the logical combination of various keywords by Boolean operators. This websites can be found which contain certain terms, but not others.

A more recent development is the ability of a number of search engines to tap implicitly existing information from the context of the query itself, and additionally evaluate. The case of incomplete queries typically present ambiguities of the query can thus be reduced, and the relevance of search results ( ie, compliance with the conscious or unconscious expectations / the seekers ) are increased. From the semantic similarities of the search terms (see also: semantics) to one, or more, behind lying meanings of the inquiry closed. The result set is expanded by results on semantically related, but in the request is not explicitly entered keywords. This usually results in not only a quantitative, but, especially for incomplete requests, and not optimally chosen search terms, also in quality ( relevance) of the results because the search intentions rather blurred depicted in these cases, by the terms of the used by the search engines statistical methods in practice are surprisingly well reproduced. ( See also: semantic search engine and Latent Semantic Indexing ).

Invisible cogiven information ( location information, and other information in the case of requests from the mobile network ) or undeveloped 'meaning preferences ' from the saved search history of the user, are other examples of not explicitly in the search terms entered default, by several search engines to modify and improvement of the results used information.

There are in addition also search engines that can be queried only with strictly formalized query languages ​​, thus in general, however, be able to answer very complex queries very precise.

A yet realizable only rudimentary or limited information base ability of search engines is the ability to process natural language and fuzzy searches. ( See also: semantic web ).

Presentation of results

The page on which the search results are output to the user ( sometimes referred to as Search engine results page, in short, SERP, referred ), ( also often physically ) divided by many search engines in the Natural Listings and the sponsors links. While the latter are included in the search index only against payment, all matching the search word websites are listed in the former.

In order to facilitate the user to use the search engine results by relevance ( Main article: Search Engine Ranking ) by what each search engine approach draws its own, mostly undisclosed criteria. These include:

  • The fundamental importance of a document, as measured by the link structure ( in the Google PageRank value ).
  • Frequency and position of keywords in each document found.
  • Classification and number of cited documents.
  • Frequency of references in other documents to the document contained in the search results and references contained in the text.
  • Classification of the quality of the referencing document ( a link from a "good" document is worth more than the reference from a mediocre document).
  • Citation of the document in other collections of links that are considered to be trustworthy, such as Dmoz.

Search behavior of users

Search engines provide access to a lot of different information. In this regard, let Searches divided into three types.

Challenges for search engines and their operators

Distribution of the use of search engines in Germany

414982
de