Archiving the Web for Scholars

As the Internet becomes an increasingly important source of material for academic research, librarians try to preserve "ephemera of the Web."
May 6, 2011

NEW YORK CITY — Many scholars, while struggling to find and patch together the surviving fragments of historical documents, have probably longed for a time machine. In the era of Internet research, they might finally get their wish. Sort of.

The Internet Archive, a nonprofit founded in 1996, has provided libraries and other institutions with the tools to preserve “the ephemera of the Web” — websites and their various documents, images, videos, and links — not just by caching a snapshot of the “landing page,” but by copying and preserving entire domains that researchers can navigate just as they would have at any point in the site’s history — even if the site moves, changes, or disappears.

Many libraries are beginning to use the Internet Archive, and its popular WayBack Machine, to develop scholar-friendly archives of websites. The organization currently hosts collections of archived websites for more than 60 different colleges and universities.

The idea is essentially to preserve websites the way libraries have long preserved newspapers via microform. As the Internet has increasingly become society’s medium of record, it has become common for the authors of scholarly papers to cite Web content that has no corresponding print documents. (Several academic style guides recently added guidelines for citing Twitter and Facebook content.)

“In many ways this is just a continuation of what libraries have always done,” says Robert Wolven, an associate university librarian for bibliographic services and collection development at Columbia University.

But while archiving newspapers — flat, homogeneous, serialized — was relatively straightforward, websites present a more slippery challenge. They evolve more fluidly: new content is added, and other content disappears without a trace. One can see exactly what The New York Times looked like on any given day in its history; not so with most websites. Nor are websites as sturdy as other media: the Internet Archive estimates that the average lifespan of a website is between 44 and 75 days.

Websites also tend to contain pages, files, and embedded objects or media that might not work if the site is simply scanned or printed out. Archivists can capture the content of a newspaper on a given day in its history simply by photographing its pages and storing them on microform or as digital files. Copying websites is a much more painstaking process. In some cases, it can take weeks. The librarians here at Columbia, which has been one of the Internet Archive’s more active partners, use an open-source “crawling” tool, called Heritrix, to copy certain websites once every three months.

Periodically copying and archiving functioning replicas of every website that might have scholarly value is, of course, too great a task for any one university — least of all a handful of Columbia library staffers working on a $716,000 grant from the Andrew W. Mellon Foundation. Columbia has therefore bitten off a piece small enough to chew: 491 websites dedicated to the documentation of human rights efforts abroad. Human rights is a topic area that stands to benefit from website archiving, since many non-governmental organizations (NGOs) that document human rights abuses in other countries publish reports on the Web that never make it into print, and these sites are more liable than others that publish original reports to vanish, taking their contributions to the historical record with them, say the Columbia archivists.

“Smaller NGOs are going to have more funding issues on a regular basis than larger, more established organizations,” says Wolven. “So if they’re depending on grant money or on any sort of periodic funding to keep their website up, as soon as that funding dries up the organization itself could disappear.” Such websites might also become targets for suppression by adversarial governments, Wolven says.

Field dispatches, commission findings, annual reports, press releases, and other types of content that some NGOs might only publish on their websites are potentially valuable to scholars in a number of fields, including history, international affairs, sociology, law, political science, and social work, says Wolven. And yet most libraries so far have not tried to collect this so-called “gray literature” and make it available to researchers “in a meaningful way,” he says.

Columbia and others collect and archive papers that NGOs publish and distribute, but the Web-only stuff has been largely ignored: at a recent conference presentation, the Columbia librarians noted that of 40 documents published on the website of Refugees International — a group that regularly publishes papers and field reports on displaced populations in 27 countries — none had been archived by Columbia’s Center for Human Rights Documentation and Research. Only 10 were listed by the Online Computer Library Center (OCLC), an international library cooperative. No library in the OCLC network held more than three of those 40 documents in its collection.

At Columbia the goal is to build a search interface that allows researchers to search its print catalog and its website archive at the same time, says Wolven.

Apart from preservation and discovery, website archiving is also crucial to making sure scholars can trace the Web-based evidence cited in scholarly papers back to their original sources, says Wolven. Websites are harder to pin down than journal-bound articles: Web addresses cited in footnotes sometimes point to a website that has expired, changed, or moved. A 2004 University of Illinois study examined website citations in three top online journals and found that about half of the URLs cited in their articles no longer pointed to the authors’ source material.

Web addresses have become so unreliable that the Modern Language Association recently stopped requiring scholars to include URLs when citing websites, instructing them instead to include information that might help readers hunt down the site with search engines. It would be simpler, of course, if they could just cite a library archive where the relevant version of the website is preserved in suspended animation, Wolven says.

Columbia is not the only institution that has teamed up with the Internet Archive to create such repositories. More than 60 colleges and universities store their website collections with the Internet Archive, which sells storage space to subscribers through a service called Archive-IT. The service also provides the technology to render sites and their content such that researchers can browse them as they would a live site.

The American University in Cairo, for example, has a collection called “2011 Egyptian Revolution,” which contains blogs, Twitter feeds, photos, videos, and online news coverage of the political tumult that engulfed the Egyptian capital this spring. George Washington University has periodically archived websites devoted to Russian parliamentary elections. The University of Texas at Austin, like Columbia, has begun compiling a record of content published on human rights websites. Other universities have used the Internet Archive to store versions of their own institutional sites.

Still, persuading resource-strapped libraries to invest in Web archiving projects might be a tough sell. As with stemming global warming or reducing the national debt, the immediate benefits of archiving Web-based ephemera are not as compelling as the long-term ones. Since Columbia started archiving human rights sites in 2008, a number of the sites have retired some of their content; some have gone offline for unknown reasons, only to reappear later, sometimes at a different address; but only one has disappeared entirely. Most can still be found on the Web, albeit only in their current incarnations.

So while the Columbia librarians are quick to point out the advantages the archive already has over Google, they admit that this type of work will really start showing its value decades down the line, when it offers access to archived versions of extinct websites that otherwise would have been lost to history.

In the meantime, the Columbia librarians and their peers are in the ironic position of being as impermanent as many of the sites they have resolved to archive. “No one is funded to do this,” says Stephen Davis, director of the Libraries Digital Program at Columbia. “We’re highly reliant on grants and special efforts.”

In Columbia’s case, assistance received includes not only contributions from foundations like Mellon, but the Internet Archive itself. In any other context, it might seem odd for a university library to entrust a relatively young organization with no formal affiliation with the university to house and protect certain of its special collections, says Davis. “We’re trusting the Internet Archive to do this for us,” he says, “and that’s a big limb that we’re out on.”

For the latest technology news and opinion from Inside Higher Ed, follow @IHEtech on Twitter.


Back to Top