Deep Web

The Deep Web (also Hidden Web or Invisible Web ) or Hidden Web refers to the part of the World Wide Web, which can not be found in a search through normal search engines. In contrast to the deep web accessible through search engines websites Visible Web ( Web Visible ) or Surface Web ( Oberflächenweb ) are called. The Deep Web consists largely of topic-specific databases ( specialized databases ) and web pages that are generated only through specific requests dynamically from databases (see for example Gallica ). In summary, it is content that is not freely available, and / or content that will be indexed by search engines.

  • 3.1 Dynamically created websites
  • 3.2 Hosts and specialist databases

Properties

According to a study ( Bergmann 2001) the company Bright Planet result for the Deep Web following properties:

The amount of data of the Deep Web is about 400 to 550 times greater than that of the surface web. Only 60 of the largest sites in the Deep Web contains about 750 terabytes of information, which exceeds the amount of the surface web by a factor of 40. There are reportedly more than 200,000 Deep Web sites. Thus, according to the study Web pages from the deep web on average 50 % more hits per month and are often linked as Web pages from the surface web. The Deep Web is the fastest growing category of new information on the Web. Nevertheless, the search for in the internet public the Deep Web is not well known. More than half of the deep Web is based in topic-specific databases.

Since Bright planet with DQM2 offers a commercial search help (may be strongly overestimated ) size specification is to be treated with great caution. The estimated amount of data from Bright Planet of the Deep Web has to be adjusted for some data:

  • Duplicates from library catalogs, which overlap
  • Data collection of the National Climatic Data Center (361 terabytes)
  • NASA data (215 terabytes)
  • Other data collections ( National Oceanographic Data Center & National Geophysical Data Center, Right to know Network, Alexa, ...)

Based on the number of records shows that the study overestimates the size of the Deep Web tenfold. However, only the information provider LexisNexis with 4.6 billion records, more than half the number of records of the search engine Google Primus. The Deep Web is therefore certainly much larger than the Oberflächenweb.

In a study from the University of California, Berkeley from 2003 the following values ​​were calculated as the circumference of the Internet: Surface Web - 167 terabytes, Deep Web - 91 850 terabytes. The printed holdings of the Library of Congress in Washington, one of the largest libraries in the world, comprising 10 terabytes.

Types of Deep Web

After Sherman & Price (2001) five types of Invisible Web are distinguished: " Opaque Web ," " private web ", " Proprietary Web," "Invisible Web" and " Truly Invisible Web".

Opaque Web

The Opaque Web (English to German opaque: opaque) are Web pages that can be indexed, but are not currently indexed for reasons of technical ability or effort -benefit ratio ( search depth, frequency of visits ).

Search engines do not take into account all levels of directories and sub-pages of a website. When entering websites Webcrawler control over links to the following websites. Web crawler itself can not navigate, even running in deep directory structures do not capture pages and can not find back to homepage. For this reason, search engines often consider more than five or six levels of directories. Extensive and thus relevant documents may be in deeper levels of the hierarchy and can not be found because of the limited development of depth of search engines.

These are file formats that can be only partially covered (for example, PDF files, Google indexes only a part of a PDF file and displays the content in HTML available ).

There is a dependency on the frequency of indexing a website (daily, monthly). Also, are constantly updated databases such as online data affected. Web pages without hyperlinks or navigation system, unlinked sites Hermit URLs or orphan pages (English orphan ) is included.

Private Web

The Private Web describes websites that could be indexed, but are not indexed due to access restrictions of the webmaster.

This can be ( internal sites ) websites on the intranet, but also password-protected data (registration and possibly password and login ) access only to specific IP addresses, protection from indexing by the Robots Exclusion Standard or protection from indexing by the meta tag values ​​noindex, nofollow and noimageindex in the source code of the page.

Proprietary Web

With Proprietary Web sites are meant which could be indexed, but are accessible only after recognition of a usage condition or by entering a password ( free or paid ).

Such websites are usually only after identification ( web-based technical databases ) available.

Invisible Web

Under the Invisible Web sites fall, which could be indexed on a purely technical perspective, but from the commercial or strategic reasons not be indexed - for example, databases with a web form.

Truly Invisible Web

With Truly Invisible Web sites are referred to, which can be indexed for technical reasons ( yet). This can be a database formats that arose prior to the WWW ( some hosts ), documents that can not be displayed directly in the browser, non-standard formats (eg Flash) as well as file formats that can not be detected due to their complexity ( graphic formats). There are also compressed data, or websites that are used only by a user navigation, graphics ( image maps ) or scripts ( frames) to use.

Databases

Dynamically created websites

Webcrawler edit almost exclusively static websites and can not reach a lot of dynamic web pages, as they can reach other pages only through hyperlinks, those dynamic pages but often. Before by filling an HTML form, what a crawler currently can not accomplish

Cooperative database providers allow search engines via mechanisms such as JDBC access to the contents of their database, compared to the ( normal ) non-cooperative databases that provide database access only via a search form.

Hosts and specialist databases

Hosts are commercial information providers, bundle the specialist databases of different information producers within a surface. Some database providers (hosts) or database producers themselves operate relational databases whose data (, retrieval tool retrieval language ) can not be retrieved without a special access option. Crawler understand either the structure or the language that is required to read information from these databases. Many hosts are working as an online service since the 1970s and operate in their databases partial database systems that are created long before the WWW.

Examples of databases: library catalogs ( OPAC), stock quotes, schedules, legal texts, job boards, news, patents, phone books, web shops, dictionaries.

217776
de