Web archiving

Web archiving is the collection and permanent depositing of online publications with the purpose to offer a glimpse into the past, scientists and the public in the future.

The largest international institute for Web archiving is the Internet Archive in San Francisco ( USA), which sees itself as an archive of the entire World Wide Web. State archives and libraries in many countries to make efforts to secure grid tradition in their area.

The German laws on archives from 1987 defined the archiving of digital documents as an obligatory task of the state archives, but the implementation of this contract does not start up. In 2006, the DNBG (Act on the German National Library ) was adopted, which expands the mission of the German National Library, the archiving of websites. The provinces also planning to change their legal deposit laws in this sense, or have already made the change.

Archiving targets

Web archiving aims to map a defined section of the existing Internet web presence in a systematic form. For this, a comprehensive collection policy, a selection procedure and the frequency of archiving be clarified in advance.

An archived website should be received with all multimedia features ( HTML, style sheets, JavaScript, images and video) over time. The later description, use and conservation are metadata such as provenance, date of acquisition, MIME type and scope of data. The metadata ensure authenticity and integrity of digital archives.

After the acquisition, technical and legal steps must be taken to ensure a permanent public accessibility.

Selection

In this selection an entire domain is gradually written into an archive. The method works because of the large memory requirements, only with smaller domains ( netarkivet.dk ).

A list of institutions is determined in advance. The stability of the institutions associated with the URLs should be checked regularly.

In the future, an "intelligent" Harvesting is conceivable that the archived because of access counts those parts of the Web ( or a selection ), which have particularly high rates of access.

Acquisition methods

Remote harvesting

The most common archiving method is the use of a Web crawler. A web crawler retrieves the contents of a website from like a human user and writes the results to an archive object.

More specifically, this means a recursive search of web pages based on the links it finds, starting from a certain starting area of either, may be a web page or a list of sites that are to be searched. Because quantitative limitations, for example due to time or space, various limitations on depth and to be archived file types are possible.

For larger projects, this is the ranking of web pages to the URL - ranking of particular importance. During a crawl process may accumulate under circumstances very many web addresses, which are then processed in a list using either the FIFO method or as a priority queue. For the latter case, the websites can imagine in a heap structure is in the process. Each web page itself forms its own heap, and each link is found on another website, in turn, forms a Unterheap, which represents an element in the heap of the previous website. This also has the advantage that in case of overflowing URL list starting with those with the lowest priority are replaced by new entries.

The output structure on the server can, however, only rarely replicate exactly in the archive.

Examples are:

  • Heritrix
  • HTTrack
  • Offline Explorer

Archiving of the "Hidden Web"

The Hidden Web or the Deep Web refers to databases, which often represent the actual contents of a website and will be distributed only upon request of a user. This also makes the Web is constantly changing, and it seems as if this have an infinite size. To accept these databases an interface is required, which is generally due to XML. For such access tools DeepArc ( Bibliothèque nationale de France ) and Xinq ( National Library of Australia ) have been developed.

Transactional archiving

This method is used to archive the results of a process use of websites. It is important for facilities that have to run for legal reasons proof of their use. This requires the installation of an additional program on the web server.

815081
de