Internet Archive

The Internet Archive in San Francisco is a nonprofit project that was founded in 1996 by Brewster Kahle. It has the long-term archiving of digital data in a freely accessible form dedicated to providing.

It stores snapshots of web pages, Usenet articles, films, television, sound recordings (including live concerts ), books and software. A mirror of the data from San Francisco is located in the Bibliotheca Alexandrina. In October 2012, the collection reached a size of 10 petabytes.


For web archive include the Wayback Machine ( " Take Me Back " ), with which one can retrieve the saved web pages in different versions. The selection of the pages to be stored via the Alexa Internet service. All stored there URLs will be called regularly and archived. The total volume was about 150 billion pages in November 2009. The pages are made ​​publicly available until approximately six months after the indexing.

In the Million Book Project will be through the Internet Archive books that have entered the public domain by the expiry of the copyrights ( United States copyright ) or for other reasons, digitized and made available for download. The digitized content is part of the Open Library.

It will entertain multiple scanning Center (2009 a total of twelve ), for example, in Richmond. Scanning is done by contract, are charged per page ten cents (as of 2009 ). The principal, most libraries, get the digital copy, a text file created by OCR, a persistent Internet address as well as the ability to host the digitized content on the servers of the association. Furthermore, there are cooperation agreements with self- digitized libraries for individual services, such as OCR and redundant hosting.

The Library of Congress has granted six exceptions of the U.S. Copyright Act Digital Millennium Copyright Act in December 2006. The Internet Archive may therefore computer software or games, which were to be abandonware, store it with the intent of preserving, if the original hardware, formats or technology are outdated. 2013 began the Internet Archive thus offer classic games as playable browser streaming via MESS emulation, eg the Atari 2600 video game E.T. the Extra -Terrestrial.

The entire archive has exceeded the size of 10 petabytes in October 2012. The archive is officially recognized by the State of California since early May 2007 as a library.

Criticism, legal and weaknesses

By official recognition as a library in the United States, the Internet Archive is in principle entitled to collect contents within the U.S. and also make available to the public within the United States. Collected what extent content from outside the U.S. and may be made publicly available outside the U.S., on the other hand depends on the copyright situation of the countries concerned.

The Wayback Machine considered opt-out markings of websites or removes them. Request of the rights owner from the archive Thus can be explained by a corresponding entry in the robots.txt file in the root directory of the domain to crawl websites and the display of archive content through the Wayback Machine lock. This opt-out procedure is in many countries, and most likely under European and German copyright law, inadmissible. However, only the contents of still existing Web servers can be locked, as on -defunct Web servers robots.txt file can be placed. The other way round it happens that websites be registered later by another owner, which sets a robot.txt and thus unknowingly access to the previous contents locks, although there is no connection between the two owners there.

The problem with the Wayback Machine is that even contents will be preserved that no longer represent the authors. Also illegal content, such slander be made available to the public for decades. If these incidents are happening on their own web server, a subsequent block / delete in the Wayback Machine using robots.txt file is possible. Because in the age of Web 2.0 content but are often also published in public forums and / or social networks, where writers can place any robots.txt, usually there is little opportunity to remove this content from the Wayback Machine.

Another weakness of the Wayback Machine is the strongly time- delayed, highly irregular and often incomplete data storage. So graphics, multimedia elements and dynamic content is often stored only to a small extent or not at all. Thus some of the archived websites no longer work or critical content necessary for a complete understanding, are missing.

The Wayback Machine also has no link capabilities of web content that have moved over the years between or within Web servers. Even the smallest changes in the URL of a web page cause the previous version of a web page only can be found even if the searcher knows of the move or the URL change.