Web Archiving at the School of Information


Published: December 16, 2014 by Alyce Scott

Alyce Scott

“Imagine that this Internet that we take for granted…is no longer available in 50 years. Imagine a researcher in 50 years time wanting to study our time to see what we experience, how we live….You can’t study life in our time without the Internet. So we must preserve it.”

René Voorburg, National Library of the Netherlands [1]

This semester I taught a new course: LIBR 284: Tools, Services, and Methods for Digital Curation. Throughout the semester my students were immersed in ways to manage data and digital objects over the course of their lifecycle, to ensure they remain understandable, accessible, and useable over time. Web archiving is one way to do exactly that, but why is it necessary? Here’s a sobering reason, from Brewster Kahle (founder of the Internet Archive): “The Web was not designed to be preserved. The average life of a Web page is about 100 days.” [2]

If you are not familiar with this topic, in a nutshell, web archiving is the process of harvesting data on the World Wide Web; storing it in a repository; preserving it; and making it available for future research. It sounds like a fairly straightforward process, doesn’t it? Make no mistake; it is complicated and full of challenges, as my students can attest.

In August 2014, the School of Information began an academic partnership with Archive-It (built at the Internet Archive). Archive-It is “The leading web archiving service for collecting and accessing cultural heritage on the web.” [3] Excellent training and user support for Archive-It was provided by Jefferson Bailey, Lori Donovan, and the Archive-It team. Through this partnership, my students were able to use the open-source Heretrix crawler to harvest sites for the final assignment, a web archiving project. The assignment in brief: create a collection of 7-10 “seed” sites that are thematically related; harvest and preserve them. In addition to selecting the objects for preservation (scoping), the students also created extensive collection-level and “seed site” metadata.

The resulting collections are built around such wide-ranging themes such the Ebola outbreak, wind energy resources of the Eastern United States; and graffiti art of the Berlin Wall. Nearly 4 million documents (212 GB of data) were captured. The collections can be easily accessed in the following ways:

Was the web archiving perfectly successful? In a word, no – but that is to be expected. Some seed sites were captured more completely than others, but in spite of the many challenges and problems the students encountered, the overall results were very good! More importantly, this was an invaluable practical learning experience for my students, and I’d like to share some examples of the observations they noted, in their final papers/presentations about the assignment:

  • “Biggest metadata challenge: how do you describe a website about bouncing cats?”
  • “In scope? Out of scope? Another issue…was understanding what would actually be captured.”
  • “…the most challenging part of the process was analyzing our results and troubleshooting seeds that were not archiving the needed content.”
  • “Currently, archiving the internet sometimes feels like archiving a mess rather than creating a coherent arrangement of information.”

In a recent article in D-Lib Magazine, Jinfang Niu (University of South Florida) wrote “Creating a web archive presents many challenges, and library and information schools should ensure that instruction in web archiving methods and skills is made part of their curricula, to help future practitioners meet those challenges.” [4] The SJSU School of Information has taken a step forward to begin meeting this need.


[1] http://libraries.universityofcalifornia.edu/groups/files/coul/docs/meeting_docs_2013-14/WAS_brochure_LowRes.pdf

[2] http://www.nextpoint.com/blog/archive-website-content/

[3] https://www.archive-it.org/

[4] http://dlib.org/dlib/march12/niu/03niu1.html