Robots Exclusion Standard

Following the agreement of the Robots Exclusion Standard Protocol reads a web crawler ( robot) in finding a website first the robots.txt file (lowercase) in the root directory ("root" ) of a domain. This file can be determined whether and how the website can be visited by a web crawler. Website owners so have the ability to lock selected areas of their web presence for (certain) search engines. The protocol is purely an indication and is dependent on the cooperation of the web crawlers. One speaks here of "friendly" web crawlers. A ostracism of certain parts of a website by the protocol does not guarantee confidentiality, to have pages or subdirectories of a server by HTTP authentication to protect an access control list (ACL ) or a similar mechanism. Some search engines display the found by web crawlers and blocking URLs still in the search result pages, omitting description of these pages.

The protocol was developed in 1994 by an independent group, is now widely recognized, however, and can be considered as quasi-standard. Beginning of June 2008 Google, Microsoft and Yahoo known to some similarities.

Construction

The robots.txt file is a text file in an easily readable format. Each row consists of two fields that are separated by a colon.

User-agent: Sidewinder Disallow: / The first line describes the crawler ( here: User-agent ), the target of the subsequent rules. There can be any number of such blocks. Web crawlers read the file from top to bottom and stop when a block refers to it. After a block, the first line with User-agent: * starts, stops a web crawler and no longer reads the rest of the file. So should the file first blocks for specific web crawler and are the last of the block for all. For each URL that is excluded, there is a separate line with the Disallow command. Blank lines are allowed only above User-agent line. Separate the blocks from each other. One line with a hash mark (#) are comments incipient possible at any point. They are for convenience only and are ignored by web crawlers.

Examples

# Robots.txt for example.com # This Webcrawler I exclude User-agent: Sidewinder Disallow: /   User-agent: Microsoft.URL.Control Disallow: /   # These directories / files should not be searched User-agent: * Disallow: / default.html Disallow: / temp / # this content disappear soon Disallow: / private / family / Geburtstage.html # Not a secret, but should not be listed in search engines. The following commands all web crawlers is prohibited indexing the entire web presence.

Alternatives

Meta-information

Web crawlers can be prevented from indexing by meta elements in the HTML source code of a website. Even Meta elements are used only to require the cooperation of "friendly " web crawlers and are not guarantees of confidentiality. If the search robot to record the website, either in the index of the search engine, yet follow the hyperlinks of the page, the meta element is to write down as follows:

In HTML documents, should be allowed for both, the indication may be either omitted or explicitly listed:

ACAP

With 1.0 ACAP ( Automated Content Access Protocol ) is an alternative to the Robots Exclusion Standards was created on 30 November 2007. From search engine operators and other service providers, this information is not used. Google excludes to use ACAP in its current form.

606927
de