Webcrawler

A web crawler (also spider or Searchbot ) is a computer program that automatically searches the World Wide Web and analyzes web pages. Web crawlers are mainly used by search engines. Other applications include the collection of RSS News Feeds, e- mail addresses or other information.

Web crawlers are a special kind of bots, ie computer programs that largely autonomously to pursue repetitive tasks.

History

The first web crawler was 1993, the World Wide Web Wanderer, who was to measure the growth of the Internet. In 1994, with WebCrawler, the first publicly available WWW search engine with full text index. From this also comes the name Webcrawler for such programs. Since the number of the search engines grew rapidly, today there are a variety of different web crawlers. They produce up to 40 % of the total Internet traffic.

Technology

As for surfing the internet reaches a web crawler via hyperlinks from one site to another URL. In this case, all the retrieved addresses are stored and visited sequentially. The hyperlinks found will be added to the list of all URLs. In this way, can be found in theory all linked and unlocked for web crawlers sides of the WWW. In practice, however, a selection of the process is often taken eventually terminated and started again. Depending on the task of web crawlers the content of web pages found, for example, evaluated and stored to facilitate future searches in the data thus collected, indexation.

Using the Robots Exclusion Standards, a website operator may robots.txt in the file and in certain meta tags in the HTML header to tell a crawler which pages to index it and what not, as long as the web crawler keeps the protocol. To combat unwanted Webcrawler there are also special websites, so-called tar pits that provide web crawlers false information and thwart this additionally strong.

Problems

A large part of the entire Internet is not detected by web crawlers and thus from public search engine, as many content not simple links, but for example only using search and restricted access portals are accessible. One speaks in these areas from the "Deep Web". In addition, the constant change of the Web as well as the manipulation of the content ( cloaking ) a problem dar.

Species

Thematically focused web crawlers are called focused crawlers and focused web crawler. The focus of web search is realized on the one hand by the classification of a Web page and the classification of individual hyperlinks. This is the focused crawler the best way through the web and indexes only ( for a subject or a domain) relevant areas of the Web. Obstacles in the practical implementation of such web crawlers are mainly non- linked sections and the training of the classifier.

Web crawlers are also used for data mining and for the investigation of the Internet ( webometrics ) and need not be necessarily limited to the WWW.

A special form of web crawlers are Harvester ( for " harvester "). This designation is used for software that scans the Internet (WWW, Usenet, etc.) for e- mail addresses, and this "harvest ". Thus, electronic addresses are collected and can be marketed after that. The result is typically, but especially in spambots, promotional e- mails ( spam). Therefore, from the earlier common practice to websites E -mail addresses to contact you via mailto: indicate link, taken more frequently distance; sometimes tries to make the addresses unreadable by the insertion of blanks or words for the bots. Thus [email protected] to a ( at) example ( dot ) com. Most bots can detect such addresses, however. An equally popular method is to embed the e- mail address in a graphic. The e- mail address is thus not available as a string in the source code of the website and therefore can not be found for the bot as text information. However, this has the disadvantage that it can not take over by "clicking " user-friendly in his e -mail program for sending the e- mail address, but must write the address for the user. Much more serious, however, is that the side so that is no longer accessible and visually impaired people are excluded as well as bots.

Another use of Web crawlers is to find copyrighted content on the Internet.

E-mail address harvesting Uniform Resource Locator Data-Mining Wrapper (data mining)

206228