• User guide
  • Our services
  • Tokyo main library
  • Kansai-kan of the ndl
  • International library of children's literature
  • Access
  • Photoduplication service
  • User registration
  • Online services
  • List of online services
  • Legislative information
  • Online catalog
  • Electronic library
  • Search guide
  • Online gallery

National Diet Library Newsletter

No. 125, June 2002
Back
Next

Report: Web Resources as Cultural Heritage:
International Symposium on Web Archiving

This is a translation of the article of  the same title in the NDL Monthly Bulletin No. 494

Report: web resources as cultural heritage: international symposium on web archiving

Table of Contents

 

For further information about the symposium (including the text of some of the papers), please see here


Introduction

Information resources on WWW continue to increase every day. How should we collect, archive, preserve them and provide permanent access? Based on these questions, the International Symposium on Web Archiving was held on January 30, 2002, at the auditorium in the Annex of the National Diet Library, under the general theme of "Information on the World Wide Web as cultural heritage". There were 216 participants from inside and outside Japan. This paper is a brief report. The full proceedings of symposium are due to be published later.
 

Public Access to Digital Materials: Roles, Rights, and Responsibilities of Libraries by Dr. Brewster Kahle (Director, The Internet Archive)

Dr. brewster kahle

The combination of a deep understanding and carrying of cultural heritage, as well as the best of the highest technologies, will make Japan a very strong player in building a digital library. In the information age when preservation and communication technology progress by leaps and bounds, some people think that libraries are no longer needed, that it is over. But this is surely wrong. Libraries have a significant role and responsibility in preserving our cultural heritage and providing universal access to digital materials. This may not be profitable, but it is still important. 

The Internet Archive 1) has been collecting WWW in collaboration with other companies and groups. The Alexa Internet has been donating a copy of the WWW to the Internet Archive now for five years. The collection is over 100 terabytes (TB) in size, so it is a large collection. There are over 16 million different websites, with over 10 billion pages in this collection. The Alexa has also attempted to create metadata automatically, and then do subject indexing of websites.
As to offering material under copyright, after we talked informally (that is, off-the-record) with the head of the Copyright Office in the United States about one such collection, the following approaches were adopted. We would make the collection available at first, and if people asked us to take it down, we would take it down and explain about "robot exclusions" 2) 3). So far, there's been no trouble. On rights issues, first we crawl websites and if the site owner contacts us we explain what it is that we are doing. The lesson for us was to try and make it available, and see what happens. 

In addition, there is also the collection of television news. We made a collection from September 11 to 18, 2001, seven days, 24 hours a day, just one week, and took the news from 20 channels from around the world. It has been used to some extent by scholars and historians, and has been very positively received. 
Today, I would like to present a gift to the National Diet Library from the Internet Archive, of a small sample of the Japanese websites that have been collected by the Internet Archive over the years. The collection is storing 20 million web pages from Japan, from 1996, 1997 and 1998. This is a collection that we hope will be useful in starting the Web Collection of the National Diet Library. 


MINERVA: Mapping the Internet Electronic Resources Virtual Archive by Ms. Cassy Ammen (Reference Librarian, Library of Congress)

Ms. cassy ammen

The Library of Congress's mission is to make its resources available and useful to the Congress and the American people and to sustain and preserve a universal collection of knowledge and creativity for future generations. 
The MINERVA Web Preservation Project 4) was established to initiate a broad program to collect and preserve materials on the World Wide Web. The project team has developed a prototype system that provides access to an archive of web materials, and created representative collections of web sites related to the 2000 U. S. National Election and the September 11 Terrorist Attack.
We planned in our first pilot projects to take an open access approach, and select web materials both in a selective collection approach and in a bulk collection approach 5). In the former, the library-staff select, collect, and catalog websites and then build a prototype access system to test and develop procedures for a production system. The latter was a project with the Internet Archive, and was a very productive collaboration. We are now in a new phase of a pilot, to work with a newly formed group known as Web-Archivist 6).
With regard to copyright, work is under way to interpret the Library's authority that it has already been granted by statutory authority, to move what we have done in our analog forms into the digital context. Interpretations should be consistent with the Library's established practices for non-digital materials, including our regular safeguards for rights holders' interests. 
Next, the procedure for selection and collection is described. First, a small group of officers in the Library of Congress recommend the websites we should collect, based on various criteria. And then the website is downloaded using a mirroring program 2). A snapshot is stored in an archive and additional snapshots are made at selected time intervals.
For our cataloging, we are using a system known as CORC (Cooperative Online Resource Catalog), which is part of the OCLC (Online Computer Library Corporation) cataloging system. MARC generated using the CORC interface would then be imported into the cataloging module of the Library of Congress's OPAC system, and then we would add subject headings, persistent identifier 7) and so on. For access, we developed a web based prototype system modeled on PANDORA Project in Australia 8). You can search for websites by title, by subject, and by URL. 
For preservation the Library of Congress has been appointed the lead agency among several government agencies to develop the National Digital Information Infrastructure and Preservation Program (NDIIPP). Another activity that the Library is beginning to participate in is a joint project through the OCLC and the Government Printing Office (GPO).
As for future explorations, we have to continue to study copyright, access issues, and legal deposit system, bulk collection, and establishment of selection criteria. 
In the area of international cooperation, there are some interesting projects. For example, the Electronic Resource Preservation and Access Network (ERANET) in Europe is a European Union endeavor, which is trying to preserve scientific information. UNESCO will be discussing, in its next budgetary year, preserving our digital heritage so that it might become a program in UNESCO 9). Next fall in Rome, the European Conference on Digital Libraries (ECDL) 10) will be held. 
If we are effectively to preserve for future generations the portion of this rapidly expanding corpus of information in digital form, we need to commit ourselves technically, legally, economically and organizationally to the full dimensions of the task. Failure to look for trusted means and methods of digital preservation will certainly exact a stiff, long-term cultural penalty.


Archiving the Web: The National Collection of Australian Online Publications by Ms. Margaret E. Phillips (Manager, Digital Archiving, National Library of Australia)

Ms. margaret e. phillips

Recognizing that online publications are an intrinsic part of the documentary heritage, the National Library of Australia, together with a number of partners, is building the National Collection of Australian Online Publications. 
Ensuring long-term access to online publications can be seen as a two-step process. First, the materials have to be identified, collected and made accessible in their current, or native, formats. That is the archiving process. Second, the materials have to be managed in such a way that they remain accessible as technology changes. This is the preservation process. The first step received most attention when we set up the PANDORA Archive in 1996. The preservation process has, however, been the growing focus of the Library's efforts as the archive has moved beyond the proof of concept stage to the operational National Collection. 
The National Collection of Australian Online Publications is an extremely selective one, containing to date only 2,000 websites. Nevertheless, it already constitutes a representative sample of Australian web publishing by academic, government and commercial publishers, as well as community organizations. A number of the websites captured in the archive, including the official website for the Sydney Olympic Games, have already disappeared from the live Internet. Moreover, about one-third of the sites have been captured on multiple occasions 11). The Collection now comprises almost 11 million files, and uses 320 gigabytes of storage. It is growing at about 500 new titles each year. 
The National Library of Australia has deliberately pursued the selective approach to archiving for its advantages of quality control, and permission from publishers to archive and provide access. It has also developed a set of selection guidelines. However, in the future, we would like to supplement this selective collection with periodic snapshots of the entire Australian domain, working together with other agencies such as the Internet Archive.
The National Library of Australia and the National Film and Sound Archive, in a joint submission of the Copyright Law Review Committee, recommended amending the legal deposit system to include non-print-based publications.
Besides, the National Library has been working with the Australian Publishers Association to develop a Code of Practice so as not to jeopardize publishers・commercial interests. In the year 2000, the Library engaged a consultant to provide advice on the direction it should take in relation to persistent identification of digital objects, and has implemented the guidelines. There are also a number of other authenticity issues 12), and problems related the archiving of databases 13).
In 2001, a very small trial migration 14) of 127 files was successfully undertaken. In the year 2000, RLG and OCLC invited the Library and a number of others to join an international working group charged with the task of proposing a draft international standard for preservation metadata. The working group has taken as its starting point the approaches to preservation metadata of the NEDLIB 15)or the Cedars 16). The recommendations for the standard are expected in early 2002.
After five years we've learned our lesson. In the digital environment, it is advantageous to liaise and cooperate with a wider range of parties. Taking a practical approach and learning by doing has worked well for us, instead of just considering it theoretically. A team-based approach to devising and implementing policy and procedures enables us to draw on the expertise of a wide range of staff within the Library, and helps to motivate their commitment to the work. As a national deposit library, the National Library of Australia has clear responsibilities for collecting and preserving the documentary heritage of Australia in all its forms, and is ready to meet future challenges.


Danish Legal Deposit on the Internet by Ms. Birgit N. Henriksen (Head of Digitization and Web Department, The Royal Library, Denmark)

Ms. birgit n. henriksen

In 1997, the Danish legislation on legal deposit was modernized and updated, enabling the Royal Library to collect "static" (as opposed to "dynamic") work on Danish websites.
As for the process of legal deposit on the static work website, firstly, a publisher has to go to the legal deposit website and fill the metadata in a form. The metadata contains information about name, phone number, data format, whether any special program is used, and if user ID and password is required.
The library staff at the Danish Department determines whether the law covers the publication and if it does, downloads all files belonging to the work using our own harvester, verifies that all items are received and all hyper links are valid. Next, the work is catalogued and classified, and then finally it is transferred to the archival server. However we are not allowed to give access over the net to deposited digital works and this part of our system is therefore not public. The archived net publications can only be viewed at the reading rooms in the two legal deposit libraries.
In four years, we have collected 10,000 net publications, consisting of nearly 700,000 files and a total volume of only 23 gigabytes. Two-thirds of the publications are from the public sector such as the government or universities. The material collected mainly consists of working papers, reports, scientific reports, guides, periodicals and newsletters.
For this selective web archiving, we do not need much hardware, so nearly all the costs are manpower. In order to decrease the amount of work, we skipped cataloging except for periodicals.
On the other hand, in bulk collections, we have to use techniques such as harvesting the entire Danish web space. It is necessary to harvest not only the ".dk" domain but also the publication placed in the ".com" domain. In addition, some material available to us as a user is simply not available for a harvester. This is, for instance, streaming contents 17) and materials with flash applications 18). In order to archive dynamic publications, contracts or agreements had to be signed with the publishers. So we sent proposals for agreements to publishers, but it is to be regretted that only a few reacted to these proposals. 
We wish to archive a broad range of the type of material that is to be collected as well as minimize the cost involved. Harvesting the entire sub-domain is suitable for this purpose and should be used to gather net material. But harvesting cannot solve all problems. It is still necessary to collect selectively for various purposes, and to use different archiving methods. We hope that in the future it will be possible to find a solution where materials can be freely accessible from the Internet.


Collecting and Archiving Web Resources and the National Diet Library by Ms. Machiko Nakai (Director, Electronic Library Development Office, National Diet Library)

Ms. machiko nakai

At present, we plan to collect information resources on the network selectively, not by the legal deposit system. This is because the amount of digital resources is huge, and it is difficult to make publishers convert them into physical format and deposit to us.
As the standard of bibliographic description, we designed the National Diet Library Metadata Element Set in March 2001. This is based on the Dublin Core Metadata Element Set, and we adopted some original qualifiers that enable mapping to the JAPAN/MARC format.
Furthermore, we are developing the WARP (Web Archiving Program: provisional name), which is a total system of acquisition, archiving, maintenance, cataloguing, and provision of online resources. We designed a prototype for metadata entry up to March 2000, and are now developing a prototype of acquisition function. 
The workflow is as follows: Selecting the resources to be acquired; Examining the structure of websites; Negotiating and contracting for acquisition with publishers; Specifying the unit of the information resources to be collected; Creating metadata; Setting harvesting and re-harvesting conditions; Trimming for removing the non-essential parts of the information; and Registering the individual object. The method of acquisition is assumed to be mainly a web-robot.
Web resources have the following characteristics: difficult to define unit of resources; easy to be updated, changed and deleted; no hierarchical structure. The task is how we deal with these characteristics by the WARP. We are trying to set standards such as harvesting conditions and time interval of re-harvesting by making repeated experiments. As for Web-based databases, we plan to construct a navigation service based on metadata.
In early 2002, the Legal Deposit System Council will begin to discuss the legal deposit system for online electronic resources, including definition of the meaning of "publishing" and copyright issue. The deliberation will also be reflected in the WARP project.

Questions & Answers

After the lecture, questions and answers, moderated by Mr. Horoyuki Taya (Director, Foreign Materials Division, Acquisitions Department), were exchanged. The issues included reliability of collected information, long-term preservation, format for preservation, and security management.
An impressive comment on the reliability of collected information was made. That is, materials on the WWW are short-life and we have no time for considering their reliability, so the first thing to do is to collect and preserve them before they disappear.

Round-Table Discussion

The next day, a round-table discussion was held on systematic and technical problems at a working level. Most had the same opinion that both bulk collection and selective collection are necessary, and metadata should be assigned automatically. There was also an opinion that it is important to collaborate with the persons concerned such as publisher in its experimental stage.

Conclusion

This symposium on the ambitious theme, "Preservation of the Internet", was a success. It is probably the first time that this theme was publicly taken up in Japan. We hope that you can catch a glimpse of the future library of the 21st century. 

Notes

1)The Internet Archive is a nonprofit organization that was founded with the purpose of collecting and preserving the information on the Web around the world. Its vast resources have been made available to the public through an interface called "The Wayback Machine" since October 2001. For details, see http://www.archive.org 
2) "Robots" are software programs that collect Websites automatically. Some are called "Web Robot", "Harvester", "Web-Crawler", or "Mirroring Program". Once you fix a URL as the starting point, the robot harvests a website as far as the depth designed, traversing hyperlinks recurrently. Collecting by robot is often called "harvesting".
3) "Robots exclusion" is a method that allows Web site administrators who do not want their site to be registered in search engines, to indicate their wish to exclude robots. The Internet Archive does not collect the sites set up for robot exclusion, while the National Library of Finland, for example, includes such sites.
4) http://www.rlg.org/preserv/diginews/diginews5-2.html#feature1 
5) The "bulk collection" approach is to harvest the information on the web by Web Robot in a wide range such as a whole country.
6) The Web Archivist is a research project to facilitate the archiving of specialized collections of web materials. For details, see http://www.webarchivist.org 
7) "Persistent identifier" is a name assigned to information resource for assuring permanent access, which will remain the same regardless of where the resource is located. For example, the URN (Uniform Resource Name) is a kind of persistent identifier designed for improving the URL. The JP number and the ISSN are also authorized as its namespace.
8)http://pandora.nla.gov.au 
9) http://unesdoc.unesco.org/images/0012/001255/125523e.pdf 
10) The official name is the "European Conference on Research and Advanced Technology for Digital Libraries". In 2001, this conference was held in Darmstadt, Germany. Web archiving was one of the main subjects. 
For details, see http://www.bnf.fr/pages/infopro/dli_ECDL2001.htm
About ECDL2002, see http://www.ecdl2002.org 
11) Most of the information on the Web is revised frequently, so it is necessary to recollect regularly.
12) A document in electronic form is easier to tamper with than the conventional paper form. So it is a problem of how we can ensure reliability, authenticity, and admissibility as evidence.
13) Dynamic websites such as databases are called "deep web", which create and display the contents dynamically, and therefore cannot be harvested by web-robot. We also need to explore methods of collecting deep web.
14) "Migration" is the transfer of digital data from one system or format to another.
15) The official name is the "Networked European Deposit Library". The NEDLIB is a collaborative project of European national deposit libraries, and an attempt to design metadata for long-term preservation or as an original harvester tool.
For details, see http://www.kb.nl/coop/nedlib/ 
16) The official name is the "CURL Exemplars in Digital Archives". The Cedars is the digital preservation project promoted by the CURL (Consortium of University Research Libraries), UK, and other organizations.
For details, see http://www.leeds.ac.uk/cedars/index.htm 
17) "Streaming" is a method by which a sound or movie file is sent from the server and executed at the same time.
18) "Flash" is a software that allows you to create web-contents including sound, graphics, and animation. 
* Last access to referenced URLs above: May 10, 2002.
 

(Digital Library Division, Projects Department, 
Kansai-kan of the National Diet Library)

up

Back
Next