Skip navigation

CDNLAO


CDNLAO Newsletter

No. 66, November 2009

Special topic: Web archiving

Web Archiving Programme at National Library Singapore

By Sharmini Chellapandi and Siow Lian San, National Library Board Singapore

Introduction

The National Library Board (NLB) was established in 1995 with the passing of The National Library Board Act (1995). One of the functions of the Board is to provide a repository for library materials published in Singapore. This is effected through the Legal Deposit framework.

In this paper, "Library materials" refer to:

a. any printed book, periodical, newspaper, pamphlet, musical score, map, chart, plan, picture, photograph, print and any other printed matter; and

b. any film (including microfilm and microfiche), film negative, tape, disc, sound track and any other device in which one or more visual images, sounds or other data are embodied so as to be capable (with or without the aid of some other equipment) of being reproduced from [as per the National Library Board Act (Cap. 197 enacted in 1995]

In addition, the National Library Singapore (NLS) has the statutory responsibility for the maintenance and preservation of library materials deposited with the Board. This has been achieved through providing an appropriate storage environment for physical publications and reformatting them into microfilm or digital images for preservation and access.

Collection of Singapore Websites

The Internet is increasingly being used as the preferred medium for sharing content and conducting transactions. Many publications which used to be published in print are now only available online. Examples include the annual reports of NLB's parent ministry, the Ministry of Information, Communications and the Arts (MICA). Similarly, the Singapore Department of Statistics has been publishing its various statistical reports online on the official Statistics Singapore website since 2007. The Internet is also becoming a popular tool for social communication and interaction.

There is therefore a need for NLS to embark on acquiring Singapore-related online materials that represent the intellectual and creative output of Singapore publishing, for archiving1. Through archiving Singapore-related websites2 and making the information available for future access, NLS aims to build a sense of community, national identity and rootedness among Singaporeans. As a result, social scientists and researchers of the future can depend on the National Library to provide the documentation of the changes that have taken place, and they will be able to learn and write the story of Singapore's past and present, based on these documentary records.

In 2006, NLS embarked on its web-archiving programme, Web Archive Singapore (WAS), to harvest Singapore-related web content from the Internet. The comprehensive collection of websites follows a three-pronged approach: selective archiving, whole domain archiving and thematic archiving.

i) Selective Archiving

NLS started Phase One of WAS as a test-bed project. The goal was to create an initial archive of selected 1,000 websites reflecting various aspects of Singapore life and heritage.

Websites deemed to be of national or historical significance, or those containing authoritative and reliable content were selected for archiving. These websites included websites of government agencies and national campaigns, schools, trade unions, co-operatives, as well as registered societies and clan associations.

NLS took three months to complete the crawling of 1,000 sites using three instances of the website crawler, Heritrix3. On average, 100 sites were archived per month per instance.

Today, these websites are still being archived on a frequency of at least three times a year. This list is reviewed on a regular basis to keep it updated as more government programmes or campaigns come up.

ii) Whole Domain Archiving

Whole domain archiving of registered ".SG" websites started in early 2007. As of May 2009, there were close to 108,000 registered domains, as reflected on the Singapore Network Information Centre (SGNIC4) website. SGNIC is the local domain registrar. In 2009, NLB signed a Memorandum of Understanding with SGNIC to facilitate access to the list of registered domains on a bi-annual basis.

Web archiving allows NLS to cast its net over the Singapore portion of the World Wide Web to capture a snapshot of Singapore-related content for archiving. By capturing different snapshots at different points in time at regular intervals, NLS aims to feature useful information about the past for the benefit of researchers and future generations. NLS targets two snapshots a year if resources are available.

NLB's current in-house capacity for web archiving is about 500 sites per month on five instances of Heritrix per server. In order to complete one domain crawl within a reasonable timeframe of three months per cycle, NLB would need to purchase an additional 40 similar servers, involving the heavy use of computing, bandwidth and storage applications, to crawl and index the huge number of websites involved. This model may not be practical due to the upfront capital costs required.

Together with the Infocomm Development Authority (IDA) of Singapore, NLS explored the feasibility of leveraging on grid computing services to archive ".SG" domains. NLB completed the first cycle within five months using IDA's grid infrastructure on one instance of Heritrix installation per server. Today, NLB is tapping on private operators of grid computing services regulated by IDA for whole domain archiving and indexing.

iii) Thematic Archiving

Thematic archiving is usually events-driven. The idea behind this strategy is to be able to collect web pages from sites, dedicated to an event and which might disappear once the event is over. Examples are the "4 Million Smiles" website at http://www.smiles2006.com/ when Singapore hosted the IMF-World Bank conference in 2006, and the SARS website (http://www.sars.gov.sg). These are no longer live today.

An event can be defined as something that

  • is of interest to a community of people
  • affects the social and/or cultural life of the people
  • creates debate among the population and is expected to be of importance to Singapore history or to have an impact on the development of society
  • generates new websites devoted to the event that is worth preserving in terms of content.

Harvesting event-driven sites may require more intense identification of the websites/webpages relating to the event, and involves more frequent harvesting. An event may cover a few or many websites. For example, during the 2006 Singapore Elections, political party sites and news websites were archived more often as there were frequent updates. The event could also be driven by a campaign, usually spearheaded by government agencies.

Events can be predictable (like elections) or unpredictable (such as disasters). These are captured 'on the fly' as the event happens and unfolds.

Access5

Archived materials are, by default, not automatically made publicly accessible. Selected archived sites are made accessible on this website: http://was.nl.sg

Web Archive Singapore (WAS) Website
Web Archive Singapore (WAS) Website (http://was.nl.sg)

Archiving and Preservation of Online Publications

NLS distinguishes between digital archiving and digital preservation. The act of downloading and storing materials in a safe place is a digital archiving strategy, which is short-term in nature. NLS is putting in place a digital preservation framework and system (infrastructure) to ensure that digital resources identified as having lasting value and significance, are safely preserved for long-term access and usage. The infrastructure will enable NLS to preserve the authenticity and integrity of the digital contents through generations of transformation that will take place due to technological developments.

The digital preservation infrastructure will comply with two core digital preservation standards: the ISO standard for Open Archival Information System (OAIS, ISO 14721:2003) and meeting the Trusted Digital Repository (TDR) requirements. Targeted to be ready by early 2011, it is a critical component in the E-content Lifecycle Management Programme, which will form the foundation for other Digital Library programmes (such as Knowledge Services infrastructure) when it is completed.

In the meantime, it is important that NLS archives websites and online publications in its collection before they are inadvertently changed or deleted.

Updates to NLB Act

Currently, NLS sends a notification letter or e-mail to the website owners of selected websites on archiving and access.

In the past one year, NLB has been working on reviewing and updating the NLB Act and Regulations to allow for the legal deposit of digital materials and archiving of selected Singapore websites. When the proposed legislative amendments are enacted by the Singapore Parliament and the new laws receive the President's assent, NLB will be empowered to undertake web-archiving activities within specified legal parameters and in some cases, based on the consent given by the original owners of copyrighted materials.

Conclusion

Increasingly, the Internet is becoming a very popular and prevalent interaction space for formal and social communication. Archiving the web will become a more important task for libraries as more information is published online only. The areas that NLS is looking at to assist and improve its web archiving programme include amendments to the NLB Act; exploring and leveraging on suitable technologies to improve and automate some processes and aligning its policies and procedures closely with international standards and best practices.

  1. The term 'Archiving' is used to refer to the act of downloading websites and web resources from the Internet and storing them on the Library's server or some form of offline storage.
  2. The term 'Websites' is used to refer to a collection of linked documents, mostly with the same basic Internet address (internal links), although there are often links to documents on other sites (external links). A URL that serves as the top-level address of a website will be said to point to that website's home page. The term includes other formats such as weblogs, discussion forums, virtual groups, private and commercial websites.
  3. Heritrix is one of the web crawling tools recommended by IIPC for archiving websites. IIPC (International Internet Preservation Consortium) consists largely of national libraries involved in web archiving. NLB is a member of IIPC.
  4. SGNIC : http://www.nic.net.sg/
  5. Access policies are jointly decided between the website owners and NLB. Some websites are permitted access on-site at NLB premises. Some archived websites may remain closed for on-line and on-site public access as they may contain sensitive information.

Copyright (C) 2009 National Library Board Singapore


Webmaster:

Branch Libraries and Cooperation Division, Administrative Department, National Diet Library
1-10-1 Nagata-cho, Chiyoda-ku, Tokyo 100-8924 Japan
Tel: +81-3-3581-2331 / Fax: +81-3-3508-2934 / E-mail: kokusai@ndl.go.jp
(The National Diet Library is responsible for the maintenance of the CDNLAO website)