Data scraping

The term screen scraping (English, like: " screen scrape " ) generally includes all method of reading text from computer screens. Currently, the term is, however, on websites used almost exclusively in reference (hence Web Scraping ). In this case, screen scraping, is specifically technologies that serve the extraction of information by selectively extracting the required data.

  • 2.1 Retrieving websites 2.1.1 Static Websites
  • 2.1.2 forms
  • 2.1.3 Personalized Websites
  • 3.1 Centralized Architecture
  • 3.2 Distributed Architecture
  • 4.1 Inspection of user behavior
  • 4.2 Distinguish between human and bot
  • 4.3 concealment

Areas of application

Search Engines and Web Mining

Search engines use the so-called crawlers to navigate the World Wide Web, Web site analytics and collecting data, such as RSS feeds or e- mail addresses. Screen-scraping techniques are applied to the Web Mining.

Replacement of Web Services

To facilitate the retrieval and processing of information from websites for the customer clearly, the provider of the page content has (also content providers ) the ability to represent the data not only in the form of a ( human-readable ) website, but they also machine-readable in a format (such as XML) recycle. Targeted retrieved data could be provided to the customer as a Web service for automated further processing.

Often, however, the content provider has no interest in the mechanized retrieval of its data or automated use of his service (particularly with respect to special features that should be exclusively reserved for real users ), or the establishment of a Web service would be associated with high costs and therefore uneconomical. In such cases, frequently the screen scraping is used to achieve the desired filter data yet from the website.

Advanced Browse

Screen scraping can be used to equip the browser to other functions or been to simplify complicated processes. This allows logins in forums automated or services of any site accessed without the user having to visit the website, but as a browser toolbar.

A simple form of such a screen scrapers represent bookmarklets

Remixing

Remixing is a technique that can be connected to the web content of various services to a new service ( See also Mashup ). If no open programming interfaces are available, must here be used also to screen-scraping mechanisms.

Abuse

However, screen scraping techniques can also be misused by external websites are copied to the will of the party and offered on a dedicated server.

Operation

Screen Scraping essentially consists of two steps:

  • Retrieving web pages
  • Extracting the relevant data

Retrieving web pages

Static websites

Ideally, the data of interest are located on a web page that can be accessed via a URL. All parameters required for the retrieval of the information is passed via URL parameters ( query string, see GET request ). In this simple case, just the website downloaded and the data is extracted with a suitable mechanism.

Forms

In many cases, the parameters are requested by filling out a Web form. The parameters are often not passed in the URL, but in the message body ( POST request ).

Personalized websites

Many web pages contain personalized information. However, the Hypertext Transfer Protocol (HTTP) has no native way to map requests to a specific person. To recognize a particular person, the server application must use HTTP patch session concepts. A frequently used option is the transmission of session IDs with the URL or through cookies. This session concepts must be supported by a screen-scraping application.

Data Extraction

A program for extracting data from Web pages is also called wrapper.

Once the website has been downloaded, it is for the extraction of data is first important that the exact location of the data is known to the website (about second table, third column).

If this is the case, there are several options available for the extraction of data. One can interpret the one downloaded web pages as strings and extract the data you about using regular expressions.

If the web page is XHTML compliant, the use of an XML parser offers. For access to XML, there are numerous assistive technologies (SAX, DOM, XPath, XQuery). Often the websites but are delivered only in the ( possibly erroneous ) HTML format, which does not comply with the XML standard. With a suitable parser, we can nevertheless sometimes an XML - compliant document produced. Alternatively, the HTML before parsing with HTML Tidy can be adjusted. Some screen scraper using a specially developed for HTML query language.

A criterion for the quality of the extraction mechanisms is the robustness against changes to the structure of the website. For this purpose, fault tolerant extraction algorithms are required.

In many cases, the structure of the website however, unknown (such as when using crawlers ). Data structures such as purchase price information or time information must then be recognized and interpreted without any fixed standards.

Architecture

Centralized Architecture

A screen scraper can be installed on a dedicated web server, retrieves the requested data at regular intervals or on demand, and in turn offers in processed form. However, this server-side approach can draw may raise legal problems and also be easily prevented from content providers by blocking the server IP.

Distributed Architecture

In the distributed approach, the information is retrieved directly from the client. Depending on the application, the information is stored in a database shared with other applications or edited in the browser. The distributed architecture can be blocked not only difficult, but also scales better.

Provider -side defensive measures

Many content providers have no interest in an isolated retrieve specific information. Reason for this may be that the provider funded by advertisements, which can be easily filtered by screen scraping. In addition, the content provider might have an interest, to force the user to a specific navigation sequence. To ensure these concerns, there are different strategies.

Control of user behavior

The server forces the user by using session IDs to a specific navigation sequence. When you call the traffic management page of the Web site a temporarily valid session ID is generated. This is transmitted via the URL, hidden form fields or cookies. When a user or a bot through a deep link comes across the page, he can not have a valid session ID. The server then forwards it to the traffic management page. This strategy is used, for example, eBay, in order to prevent deep links on auction lists. However, a specially programmed Screen Scraper may first get a valid session ID and then download the file.

The following example shows a JavaScript -based screen scraper, which bypasses the used of eBay strategy. It first loads the page down, extracted with a regular expression contains a valid URL (in this case the list of auctions in which disks are auctioned ) and opens it in the browser.

EbayScraper function () {      req = new XMLHttpRequest ();      req.open ( 'GET ', ' http://computer.ebay.de ', false );      req.send ( null);      var regex = new RegExp ( 'http: \ / \ / computer \ listings \ ebay \ en \ / Floppy Zip Streamer_Disketten_ [a- zA -Z0 -9 ] *.. . ');      window.location = req.responseText.match ( regex);   } In addition to the misappropriation of session IDs, there are other ways to check the user's behavior:

  • Control of the referrer to ward off Deep Links
  • Control whether embedded in the page elements ( graphics, etc. ) to be downloaded quickly
  • Check whether JavaScript elements are executed

However, all these methods involve certain problems, such as referrer information is not mandatory, because embedded elements may be shipped from a proxy or from the cache or because the user has simply disables the display of graphics or the execution of JavaScript.

Distinguish between human and bot

The server tries to detect whether it is a human or a bot at the client before delivery of the data. A method for frequently used is the use of CAPTCHAs. Here, a task the client is asked, easy for people as possible, for a machine but is very hard to solve. This can be a computational task, or the typing of letters, often lies the difficulty for the machine in recognition of the task. This can, for example, be achieved by the calculation is not submitted as text but as an image.

Theoretically, for all spelling also develop bots that can solve these problems based on Optical Character Recognition ( extraction of the object from an image ), so that this protection can be circumvented. There is also the possibility to pass the sub-task to a human, so this solves the captcha for the machine. However, both means significantly more work for the Botbetreiber.

Veiling

The information offered in machinery difficult or impossible readable form. About as graphics, Flash animations or Java applets. Here, however, often suffers under the serviceability.

To conceal the data JavaScript can also be used. This method is especially used against email harvesters that collect e- mail addresses for sending spam. The actual data is not transferred in the HTML code, but are not written by JavaScript in the web page. The data can also be encrypted transfer and are only decrypted when viewing the page. Using an obfuscator of the program code can be obfuscated to hinder the development of a screen scrapers.

Simple example for concealing an e -mail address with javascript ( without encryption):

Function mail () {       var name = "mail";       var domain = "example.com";       var mailto = ' mailto: ' name ' @' domain;       document.write (mailto );    } Creation of screen scrapers

Depending on the complexity of the task, a screen scraper must be reprogrammed. Using toolkits Screen Scraper, however, can also create without programming knowledge. For the implementation of the form, there are various possibilities, such as a library, as a proxy server or as a standalone program.

Applications

Piggy Bank is a technology developed by the Simile project at MIT extension for Firefox. With it links from multiple service providers can be realized. It automatically detects on a website offered RDF resources. These can be combined ( using Google Maps as geographical information ) is stored, managed, and with other services. Provides the website any RDF resources, there is also the possibility JavaScript or XSLT -based screen scraper to use.

A better-known Firefox extension is Greasemonkey. It allows the users to do custom JavaScript files in the browser that can customize the appearance and behavior of the displayed web page, without the need for access to the actual web page. This makes it possible, for example, websites to expand features, fix bugs in the presentation, bring in content from other websites and to perform repetitive tasks automatically.

Legal problems

When scraping of foreign Web sites must be paid to the copyright compliance, especially when the contents are integrated through a private offering. A legal gray area, however, is the provision of programs that enable a client-side screen scraping. Some providers prohibit the automatic selection of data explicitly in the terms of use.

Another problem is or may be hiding information represents about advertising or legally relevant information such as disclaimers, warnings or even the automatic confirmation of the terms by the Screen Scraper, without the user gets to this face.

719443
de