Table
of Contents
For
further information about the symposium (including the text of some of
the papers), please see here
Introduction
Information resources on WWW continue to increase
every day. How should we collect, archive, preserve them and provide permanent
access? Based on these questions, the International Symposium on Web Archiving
was held on January 30, 2002, at the auditorium in the Annex of the National
Diet Library, under the general theme of "Information on the World Wide
Web as cultural heritage". There were 216 participants from inside and
outside Japan. This paper is a brief report. The full proceedings of symposium
are due to be published later.
Public
Access to Digital Materials: Roles, Rights, and Responsibilities of Libraries
by Dr. Brewster Kahle (Director, The Internet Archive)

The combination of a deep understanding and carrying
of cultural heritage, as well as the best of the highest technologies,
will make Japan a very strong player in building a digital library. In
the information age when preservation and communication technology progress
by leaps and bounds, some people think that libraries are no longer needed,
that it is over. But this is surely wrong. Libraries have a significant
role and responsibility in preserving our cultural heritage and providing
universal access to digital materials. This may not be profitable, but
it is still important.
The Internet Archive 1) has
been collecting WWW in collaboration with other companies and groups. The
Alexa Internet has been donating a copy of the WWW to the Internet Archive
now for five years. The collection is over 100 terabytes (TB) in size,
so it is a large collection. There are over 16 million different websites,
with over 10 billion pages in this collection. The Alexa has also attempted
to create metadata automatically, and then do subject indexing of websites.
As to offering material under copyright, after
we talked informally (that is, off-the-record) with the head of the Copyright
Office in the United States about one such collection, the following approaches
were adopted. We would make the collection available at first, and if people
asked us to take it down, we would take it down and explain about "robot
exclusions" 2) 3). So far, there's been no trouble.
On rights issues, first we crawl websites and if the site owner contacts
us we explain what it is that we are doing. The lesson for us was to try
and make it available, and see what happens.
In addition, there is also the collection of
television news. We made a collection from September 11 to 18, 2001, seven
days, 24 hours a day, just one week, and took the news from 20 channels
from around the world. It has been used to some extent by scholars and
historians, and has been very positively received.
Today, I would like to present a gift to the
National Diet Library from the Internet Archive, of a small sample of the
Japanese websites that have been collected by the Internet Archive over
the years. The collection is storing 20 million web pages from Japan, from
1996, 1997 and 1998. This is a collection that we hope will be useful in
starting the Web Collection of the National Diet Library.
MINERVA:
Mapping the Internet Electronic Resources Virtual Archive by Ms. Cassy
Ammen (Reference Librarian, Library of Congress)

The Library of Congress's mission is to make its
resources available and useful to the Congress and the American people
and to sustain and preserve a universal collection of knowledge and creativity
for future generations.
The MINERVA Web Preservation Project 4)
was established to initiate a broad program to collect and preserve materials
on the World Wide Web. The project team has developed a prototype system
that provides access to an archive of web materials, and created representative
collections of web sites related to the 2000 U. S. National Election and
the September 11 Terrorist Attack.
We planned in our first pilot projects to take
an open access approach, and select web materials both in a selective collection
approach and in a bulk collection approach 5). In the
former, the library-staff select, collect, and catalog websites and then
build a prototype access system to test and develop procedures for a production
system. The latter was a project with the Internet Archive, and was a very
productive collaboration. We are now in a new phase of a pilot, to work
with a newly formed group known as Web-Archivist 6).
With regard to copyright, work is under way to
interpret the Library's authority that it has already been granted by statutory
authority, to move what we have done in our analog forms into the digital
context. Interpretations should be consistent with the Library's established
practices for non-digital materials, including our regular safeguards for
rights holders' interests.
Next, the procedure for selection and collection
is described. First, a small group of officers in the Library of Congress
recommend the websites we should collect, based on various criteria. And
then the website is downloaded using a mirroring program 2).
A snapshot is stored in an archive and additional snapshots are made at
selected time intervals.
For our cataloging, we are using a system known
as CORC (Cooperative Online Resource Catalog), which is part of the OCLC
(Online Computer Library Corporation) cataloging system. MARC generated
using the CORC interface would then be imported into the cataloging module
of the Library of Congress's OPAC system, and then we would add subject
headings, persistent identifier 7) and so on. For access,
we
developed a web based prototype system modeled on PANDORA Project in Australia
8). You can search for websites by title, by subject,
and by URL.
For preservation the Library of Congress has
been appointed the lead agency among several government agencies to develop
the National Digital Information Infrastructure and Preservation Program
(NDIIPP). Another activity that the Library is beginning to participate
in is a joint project through the OCLC and the Government Printing Office
(GPO).
As for future explorations, we have to continue
to study copyright, access issues, and legal deposit system, bulk collection,
and establishment of selection criteria.
In the area of international cooperation, there
are some interesting projects. For example, the Electronic Resource Preservation
and Access Network (ERANET) in Europe is a European Union endeavor, which
is trying to preserve scientific information. UNESCO will be discussing,
in its next budgetary year, preserving our digital heritage so that it
might become a program in UNESCO 9). Next fall in Rome,
the European Conference on Digital Libraries (ECDL) 10)
will be held.
If we are effectively to preserve for future
generations the portion of this rapidly expanding corpus of information
in digital form, we need to commit ourselves technically, legally, economically
and organizationally to the full dimensions of the task. Failure to look
for trusted means and methods of digital preservation will certainly exact
a stiff, long-term cultural penalty.
Archiving
the Web: The National Collection of Australian Online Publications by Ms.
Margaret E. Phillips (Manager, Digital Archiving, National Library of Australia)

Recognizing that online publications are an intrinsic
part of the documentary heritage, the National Library of Australia, together
with a number of partners, is building the National Collection of Australian
Online Publications.
Ensuring long-term access to online publications
can be seen as a two-step process. First, the materials have to be identified,
collected and made accessible in their current, or native, formats. That
is the archiving process. Second, the materials have to be managed in such
a way that they remain accessible as technology changes. This is the preservation
process. The first step received most attention when we set up the PANDORA
Archive in 1996. The preservation process has, however, been the growing
focus of the Library's efforts as the archive has moved beyond the proof
of concept stage to the operational National Collection.
The National Collection of Australian Online
Publications is an extremely selective one, containing to date only 2,000
websites. Nevertheless, it already constitutes a representative sample
of Australian web publishing by academic, government and commercial publishers,
as well as community organizations. A number of the websites captured in
the archive, including the official website for the Sydney Olympic Games,
have already disappeared from the live Internet. Moreover, about one-third
of the sites have been captured on multiple occasions 11).
The Collection now comprises almost 11 million files, and uses 320 gigabytes
of storage. It is growing at about 500 new titles each year.
The National Library of Australia has deliberately
pursued the selective approach to archiving for its advantages of quality
control, and permission from publishers to archive and provide access.
It has also developed a set of selection guidelines. However, in the future,
we would like to supplement this selective collection with periodic snapshots
of the entire Australian domain, working together with other agencies such
as the Internet Archive.
The National Library of Australia and the National
Film and Sound Archive, in a joint submission of the Copyright Law Review
Committee, recommended amending the legal deposit system to include non-print-based
publications.
Besides, the National Library has been working
with the Australian Publishers Association to develop a Code of Practice
so as not to jeopardize publishers・commercial interests. In the year 2000,
the Library engaged a consultant to provide advice on the direction it
should take in relation to persistent identification of digital objects,
and has implemented the guidelines. There are also a number of other authenticity
issues 12), and problems related the archiving of databases
13).
In 2001, a very small trial migration 14)
of 127 files was successfully undertaken. In the year 2000, RLG and OCLC
invited the Library and a number of others to join an international working
group charged with the task of proposing a draft international standard
for preservation metadata. The working group has taken as its starting
point the approaches to preservation metadata of the NEDLIB 15)or
the Cedars 16). The recommendations for the standard
are expected in early 2002.
After five years we've learned our lesson. In
the digital environment, it is advantageous to liaise and cooperate with
a wider range of parties. Taking a practical approach and learning by doing
has worked well for us, instead of just considering it theoretically. A
team-based approach to devising and implementing policy and procedures
enables us to draw on the expertise of a wide range of staff within the
Library, and helps to motivate their commitment to the work. As a national
deposit library, the National Library of Australia has clear responsibilities
for collecting and preserving the documentary heritage of Australia in
all its forms, and is ready to meet future challenges.
Danish
Legal Deposit on the Internet by Ms. Birgit N. Henriksen (Head of Digitization
and Web Department, The Royal Library, Denmark)

In 1997, the Danish legislation on legal deposit
was modernized and updated, enabling the Royal Library to collect "static"
(as opposed to "dynamic") work on Danish websites.
As for the process of legal deposit on the static
work website, firstly, a publisher has to go to the legal deposit website
and fill the metadata in a form. The metadata contains information about
name, phone number, data format, whether any special program is used, and
if user ID and password is required.
The library staff at the Danish Department determines
whether the law covers the publication and if it does, downloads all files
belonging to the work using our own harvester, verifies that all items
are received and all hyper links are valid. Next, the work is catalogued
and classified, and then finally it is transferred to the archival server.
However we are not allowed to give access over the net to deposited digital
works and this part of our system is therefore not public. The archived
net publications can only be viewed at the reading rooms in the two legal
deposit libraries.
In four years, we have collected 10,000 net publications,
consisting of nearly 700,000 files and a total volume of only 23 gigabytes.
Two-thirds of the publications are from the public sector such as the government
or universities. The material collected mainly consists of working papers,
reports, scientific reports, guides, periodicals and newsletters.
For this selective web archiving, we do not need
much hardware, so nearly all the costs are manpower. In order to decrease
the amount of work, we skipped cataloging except for periodicals.
On the other hand, in bulk collections, we have
to use techniques such as harvesting the entire Danish web space. It is
necessary to harvest not only the ".dk" domain but also the publication
placed in the ".com" domain. In addition, some material available to us
as a user is simply not available for a harvester. This is, for instance,
streaming contents 17) and materials with flash applications
18).
In order to archive dynamic publications, contracts or agreements had to
be signed with the publishers. So we sent proposals for agreements to publishers,
but it is to be regretted that only a few reacted to these proposals.
We wish to archive a broad range of the type
of material that is to be collected as well as minimize the cost involved.
Harvesting the entire sub-domain is suitable for this purpose and should
be used to gather net material. But harvesting cannot solve all problems.
It is still necessary to collect selectively for various purposes, and
to use different archiving methods. We hope that in the future it will
be possible to find a solution where materials can be freely accessible
from the Internet.
Collecting
and Archiving Web Resources and the National Diet Library by Ms. Machiko
Nakai (Director, Electronic Library Development Office, National Diet Library)

At present, we plan to collect information resources
on the network selectively, not by the legal deposit system. This is because
the amount of digital resources is huge, and it is difficult to make publishers
convert them into physical format and deposit to us.
As the standard of bibliographic description,
we designed the National Diet Library Metadata Element Set in March 2001.
This is based on the Dublin Core Metadata Element Set, and we adopted some
original qualifiers that enable mapping to the JAPAN/MARC format.
Furthermore, we are developing the WARP (Web
Archiving Program: provisional name), which is a total system of acquisition,
archiving, maintenance, cataloguing, and provision of online resources.
We designed a prototype for metadata entry up to March 2000, and are now
developing a prototype of acquisition function.
The workflow is as follows: Selecting the resources
to be acquired; Examining the structure of websites; Negotiating and contracting
for acquisition with publishers; Specifying the unit of the information
resources to be collected; Creating metadata; Setting harvesting and re-harvesting
conditions; Trimming for removing the non-essential parts of the information;
and Registering the individual object. The method of acquisition is assumed
to be mainly a web-robot.
Web resources have the following characteristics:
difficult to define unit of resources; easy to be updated, changed and
deleted; no hierarchical structure. The task is how we deal with these
characteristics by the WARP. We are trying to set standards such as harvesting
conditions and time interval of re-harvesting by making repeated experiments.
As for Web-based databases, we plan to construct a navigation service based
on metadata.
In early 2002, the Legal Deposit System Council
will begin to discuss the legal deposit system for online electronic resources,
including definition of the meaning of "publishing" and copyright issue.
The deliberation will also be reflected in the WARP project.
Questions
& Answers
After the lecture, questions and answers, moderated
by Mr. Horoyuki Taya (Director, Foreign Materials Division, Acquisitions
Department), were exchanged. The issues included reliability of collected
information, long-term preservation, format for preservation, and security
management.
An impressive comment on the reliability of collected
information was made. That is, materials on the WWW are short-life and
we have no time for considering their reliability, so the first thing to
do is to collect and preserve them before they disappear.
Round-Table
Discussion
The next day, a round-table discussion was held
on systematic and technical problems at a working level. Most had the same
opinion that both bulk collection and selective collection are necessary,
and metadata should be assigned automatically. There was also an opinion
that it is important to collaborate with the persons concerned such as
publisher in its experimental stage.
Conclusion
This symposium on the ambitious theme, "Preservation
of the Internet", was a success. It is probably the first time that this
theme was publicly taken up in Japan. We hope that you can catch a glimpse
of the future library of the 21st century.
Notes
1)The
Internet Archive is a nonprofit organization that was founded with the
purpose of collecting and preserving the information on the Web around
the world. Its vast resources have been made available to the public through
an interface called "The Wayback Machine" since October 2001. For details,
see http://www.archive.org
2)
"Robots" are software programs that collect Websites automatically. Some
are called "Web Robot", "Harvester", "Web-Crawler", or "Mirroring Program".
Once you fix a URL as the starting point, the robot harvests a website
as far as the depth designed, traversing hyperlinks recurrently. Collecting
by robot is often called "harvesting".
3)
"Robots exclusion" is a method that allows Web site administrators
who do not want their site to be registered in search engines, to indicate
their wish to exclude robots. The Internet Archive does not collect the
sites set up for robot exclusion, while the National Library of Finland,
for example, includes such sites.
4)
http://www.rlg.org/preserv/diginews/diginews5-2.html#feature1
5)
The "bulk collection" approach is to harvest the information on the web
by Web Robot in a wide range such as a whole country.
6)
The Web Archivist is a research project to facilitate the archiving of
specialized collections of web materials. For details, see http://www.webarchivist.org
7)
"Persistent identifier" is a name assigned to information resource for
assuring permanent access, which will remain the same regardless of where
the resource is located. For example, the URN (Uniform Resource Name) is
a kind of persistent identifier designed for improving the URL. The JP
number and the ISSN are also authorized as its namespace.
8)http://pandora.nla.gov.au
9)
http://unesdoc.unesco.org/images/0012/001255/125523e.pdf
10)
The
official name is the "European Conference on Research and Advanced Technology
for Digital Libraries". In 2001, this conference was held in Darmstadt,
Germany. Web archiving was one of the main subjects.
For
details, see http://www.bnf.fr/pages/infopro/dli_ECDL2001.htm
About
ECDL2002, see http://www.ecdl2002.org
11)
Most
of the information on the Web is revised frequently, so it is necessary
to recollect regularly.
12)
A document in electronic form is easier to tamper with than the conventional
paper form. So it is a problem of how we can ensure reliability, authenticity,
and admissibility as evidence.
13)
Dynamic websites such as databases are called "deep web", which create
and display the contents dynamically, and therefore cannot be harvested
by web-robot. We also need to explore methods of collecting deep web.
14)
"Migration" is the transfer of digital data from one system or format to
another.
15)
The official name is the "Networked European Deposit Library". The NEDLIB
is a collaborative project of European national deposit libraries, and
an attempt to design metadata for long-term preservation or as an original
harvester tool.
For
details, see http://www.kb.nl/coop/nedlib/
16)
The official name is the "CURL Exemplars in Digital Archives". The Cedars
is the digital preservation project promoted by the CURL (Consortium of
University Research Libraries), UK, and other organizations.
For
details, see http://www.leeds.ac.uk/cedars/index.htm
17)
"Streaming"
is a method by which a sound or movie file is sent from the server and
executed at the same time.
18)
"Flash" is a software that allows you to create web-contents including
sound, graphics, and animation.
*
Last access to referenced URLs above: May 10, 2002.
(Digital Library Division, Projects Department,
Kansai-kan of the National Diet Library)
up
|