Lazy Preservation: Reconstructing Websites from the Web Infrastructure.

Research output: ThesisDoctoral Thesis

Abstract

Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, webmasters or concerned third parties have attempted to
recover some of their websites from the Internet Archive. Still others have sought to retrieve missing
resources from the caches of commercial search engines. Inspired by these post hoc reconstruction
attempts, this dissertation introduces the concept of lazy preservation– digital preservation performed as a result of the normal operations of the Web Infrastructure (web archives, search engines
and caches). First, the Web Infrastructure (WI) is characterized by its preservation capacity and
behavior. Methods for reconstructing websites from the WI are then investigated, and a new type
of crawler is introduced: the web-repository crawler. Several experiments are used to measure and
evaluate the effectiveness of lazy preservation for a variety of websites, and various web-repository
crawler strategies are introduced and evaluated. The implementation of the web-repository crawler
Warrick is presented, and real usage data from the public is analyzed. Finally, a novel technique for
recovering the generative functionality (i.e., CGI programs, databases, etc.) of websites is presented,
and its effectiveness is demonstrated by recovering an entire Eprints digital library from the WI.
Original languageAmerican English
QualificationPh.D.
Awarding Institution
  • Computer Science
Supervisors/Advisors
  • Nelson, Michael L., Advisor, External person
StatePublished - Dec 2007

Disciplines

  • Databases and Information Systems

Cite this