Skip navigation

CDNLAO


CDNLAO Newsletter

No. 58, March 2007

Special topic: Archiving and Preservation of Online Publications

Web Archiving at the National Library of Australia
PANDORA : Australia's Web Archive

by National Library of Australia

The history of web archiving at the National Library goes back more than 10 years with the establishment of the PANDORA Archive in 1996. The Library's approach to web archiving from the outset was selective, pragmatic and practical as a result of which the PANDORA Archive includes content collected from a decade ago to the present.

The Library has strongly supported a collaborative approach to building the content of the PANDORA Archive by engaging the participation of Australian state and territory libraries and other major collecting institutions. Currently, there are 10 participant libraries selecting and archiving content for PANDORA including all the Australian mainland state libraries, the Northern Territory Library, the National Film and Sound Archive, the Australian War Memorial and the Australian Institute for Aboriginal and Torres Strait Islander Studies (AIATSIS).

The collaborative effort in collecting web content for the Archive has been made possible because of the Library's development of a web archiving workflow management system called PANDAS (PANDORA Digital Archiving System). This web based system allows curatorial staff in participating libraries to undertake the archiving of web resources through an infrastructure maintained at the National Library. PANDAS was first implemented as a production system in June 2001 with a second enhanced version released in July 2002. A completely reengineered and enhanced third version of PANDAS is due for release in early 2007.

An important consideration for the Library is the provision of access to the archived web resources. By taking a selective approach to web archiving it has been possible to work at a scale that allows for the necessary permissions to be sought from the publishers prior to archiving and for MARC records to be created for the selected resources. Access to the archival content is by means of the PANDORA website portal which provides a full text Lucene search engine as well as subject and title listings. The archiving management system (PANDAS) also includes a restrictions module which allows curators to easily apply access restrictions to archived titles when required.

At the end of 2006 figures for the size of the PANDORA Archive were:

 Number of archived titles: 13,719
 Number of archived instances: 27,933
 Number of files: 34,541,963
 Data size in terabytes: 1.3

Archiving the Australian web domain

The selective archiving approach undertaken for the PANDORA Archive has produced an archive of resources that are collected with attention given to selection, quality assessment of the archiving process, obtaining prior publisher permission and providing access to the content. While such factors represent some of the advantages of selective archiving, given the resources required for such an approach the scale of archiving possible using the selective approach remains limited. For this reason the National Library in collaboration with the Internet Archive began large scale domain harvesting in 2005 as an adjunct to the selective archiving for PANDORA. To date two large harvest crawls have been completed, the first in June-July 2005 and the second in August-September 2006. The objective of both harvests was to crawl broadly and deeply as much of the .au top level domain as possible as well as non .au content located on Australian host servers that could be identified by an automated geoIP lookup mechanism used during the crawl.

Providing access to the content of the Australian domain crawls is a priority issue for the Library. While it is not publicly accessible at this time it is hoped that recent and proposed changes to relevant legislation will better support the Library's collecting and preservation responsibilities. Amendments to the Copyright Act in 2006 provide some recognition of and support for the needs of the Library's digital collections preservation activities. It is also hoped that in the near future Legal Deposit at the Commonwealth level in Australia will be extended to digital materials in such a way as to better support the efficient collection of web resources and the provision of access to such archival content.

Statistics for the two Australian web domain harvests include:

 Domain Harvest 2005 2006
Unique files collected: 185,549,662 596,280,285
 Hosts crawled: 811,523 1,046,038
Data size in terabytes: 6.69 19.04
Duration of crawl: 4 weeks 5 weeks

Strategies and collaborative alliances

With 10 years experience in selective web archiving and two years of domain harvesting, the National Library is now in a strong position to review its future strategies for archiving Australian web resources. Such strategies will focus on efficiently collecting resources in a way that better addresses the ever increasing scale of web publishing; while also recognising the value of timeliness and quality in the archiving process, the need for useful and practical description and resource discovery pathways, and the importance of preservation processes to sustain access to content over time.

In December 2006 a new Branch within the Library's Collections Management Division was formed to bring the Library's web archiving and digital preservation activities closer together. The new Web Archiving and Digital Preservation Branch, headed by Colin Webb, is strategically placed to better align the functions of collecting and describing web resources and developing and applying digital preservation management to the archival data.

The Library also remains strongly committed to collaborative alliances that further the work of web archiving and preservation through the development of standards, strategies and tools. At the international level, the Library was a founder member in 2003 of the International Internet Preservation Consortium (IIPC) and continues its association as an IIPC steering committee member. At the national level, the Library is currently involved with the Australian Partnership for Sustainable Repositories (APSR) and within this collaboration has focused on developing a PREMIS and METS based functional framework for preservation metadata requirements and an Automated Obsolescence Notification System (AONS).

Contact Details

Questions or comments regarding web archiving at the National Library of Australia may be address to:


Paul Koerbin
Manager Web Archiving
Web Archiving and Digital Preservation Branch
National Library of Australia
CANBERRA  ACT  2600
61-2-62621411
pkoerbin@nla.gov.au


Copyright (C)2007 National Library of Australia


Webmaster:

Branch Libraries and Cooperation Division, Administrative Department, National Diet Library
1-10-1 Nagata-cho, Chiyoda-ku, Tokyo 100-8924 Japan
Tel: +81-3-3581-2331 / Fax: +81-3-3508-2934 / E-mail: kokusai@ndl.go.jp
(The National Diet Library is responsible for the maintenance of the CDNLAO website)