National Diet Library Newsletter
No. 160, April 2008
|
|
|
Lecture meeting and discussion
“Present and future prospects of web archiving
- for international partnerships”
This is an abridged translation of the article of the same title
in the NDL Monthly Bulletin No. 565 (April 2008).

On January 23, 2008, a lecture meeting and discussion titled “Present and future prospects of web archiving – for international partnerships” was held in the National Diet Library (NDL).
Today, while vast amounts of useful digital information are distributed rapidly, the difficulty of long-term preservation and the instability of such information is becoming an issue. In particular, Web information is ephemeral and it is said that the average life span of a web site is only 44 to 75 days.
In this lecture meeting, experts who are actively and internationally involved in this issue were invited to talk about digital archiving, focusing on web archiving including its significance and the latest international trends.
Lecture by Mr. Julien Masanès, Director, European Archive

Building an archive of the Internet: Challenges ahead of us
Unlike paper publications, anyone can publish his/her work in the Internet. It is said that the amount of information on the Web is a million times what is published in a year in the world. As libraries are leaders and administrators in information communities, they should archive the web information.
Web archiving has its limits, However taking customized sites for instance, we can capture only a moment of dynamically created pages. In addition, it is hard to archive the deep Web and we can capture the surface Web only. These are some of the challenges in web archiving.
Accessibility of online content
The appearance of the web enabled us to search a variety of contents via search engines. Now we have two new sources of content on the Web. One is web archives, which now preserve more data than is on the existing Web. The other is massive digitization projects.
However, these two new resources include archives which are not supposed to be open to the public and exclusive digitized collections which block search engines. These run the risk of destroying the openness of web archives. The European Commission currently addresses the need for public domain content to remain in the public domain while it is being digitized. The Web spread because of its characteristic of openness. We must not lose this advantage.
Examples of projects to move forward
Currently in Europe, there is a movement to establish a shared platform of web archives. The purpose is to remove technical difficulties for web archiving in repository institutions.
Web contents are distributed beyond national borders. Therefore international coordination and cooperation are needed to build web archives, which have to be utilized mutually. In the future, technologies to link archives in various countries will be needed, for example, one which makes use of dispersed indexes.
Lecture by Ms. Kris Carpenter, Director, Web Archive, Internet Archive

The Internet Archive (IA)
The Internet Archive (IA) is a non-profit organization founded in 1996 by Dr. Brewster Kahle, as an ‘Internet library’, to provide universal and permanent access to digital information for the people.
The IA contains approximately three petabytes (three million gigabytes) of data, including web pages, films and videos, music and spoken word (audio books), books and texts, software, and images.
The International Internet Preservation Consortium (IIPC)
International Internet Preservation Consortium (IIPC) is a group of thirty five or more libraries, archives, cultural institutions, and research centers from around the globe that are helping to collect and preserve a rich body of Internet content. It was founded by the IA and eleven national libraries.
We think that the NDL’s participation in the IIPC from this year is very important because it can take an administrative role. We think the NDL has the ideal conditions for taking an administrative role because the amount of web pages in the JP domain is large and the NDL has already launched web archiving via WARP.
Web Archiving
Web archiving is the process of collecting, storing, preserving, and ensuring access to resources that are published online. Why do we need it? Because the Web is part of our society and culture, in other word, it is ourselves. Unlike analog content, web content here today may be gone tomorrow.
I will tell you why partnership is needed for web archiving. Selecting collections to archive, crawling, monitoring the crawls, managing collections, mining and analyzing data sets, ensuring access - we cannot address those tasks alone. Sharing out the burden to each institution can lead to achieving more results. We need to work closely to solve future issues such as harvesting the deep Web and coping with new web technologies.
I would like to talk about what we have learned from our projects. First of all, users count. What is not visible to your users is not important to them either. We need to spread information actively to justify the cost, time and resources spent on web archiving as well as to continue archiving by obtaining budget. Results also matter. Just building up a collection is not sufficient. The quality of the collection has to be maintained. Active feedback is also important. Do not wait to get started. What is there today might be gone tomorrow. Partnerships can multiply your efforts, resources and results and share knowledge and tools.
Lecture by Dr. Masaru Kitsuregawa, Professor, Institute of Industrial Science, University of Tokyo

Prior to the panel discussion, Dr. Kitsuregawa introduced his research “Socio Sense.”
Things that happen to the real world are immediately reflected in the Web. We can say that the Web is a device to perceive the society, in short, a sensor to society. Socio Sense is a study which uses web archived data and I use data captured since 1999.
We can get an overview of the Web space by extracting or analyzing links. For instance, with regard to the websites of banks which experienced merger, we can see links to the merger partner tend to increase with age before the announcement of the merger. We can also see the transition of young peoples’ language or usage of new words. Thus web archiving enables us to have a time-series panoramic view of the society.
Search for web archives should be able to chase the development of web pages from past to present, which is impossible for existing search engines. Capturing dynamically updated pages or contents over and over again and developing technologies for analyzing data are also important in addition to just archiving them.
Panel discussion

In the panel discussion with Mr. Masanès, Ms. Carpenter and Dr. Kitsuregawa, the following issues were discussed.
The importance of web archiving in Asia
It is said that web contents of China, Japan and South Korea occupy one third of the world’s web contents, so it is important to conduct web archiving in Asia. As the principles of processing these languages are similar, we expect that they will hold talks to ensure connectivity and mutual search.
Challenges in continuing web archiving
Full automation of crawling is impossible. Human skills are necessary for monitoring and responding to claims.
Google and IBM seem to deem cooperation with archives worthy in public relations. It is also important for Asian countries to gain funding by approaching the private sector.
Efforts in universities and libraries
Coordination and cooperation are needed to succeed in web archiving. In the United States, there is a program called National Digital Information Infrastructure and Preservation Program (NDIIPP) led by the Library of Congress.
We can see a sea change in the Web, for instance, the explosive increase of contents with high interactivity including so-called web 2.0 contents and video contents. In order to cope with these new technologies, budget and support are essential.
The European Union (EU) is being funded by the EC (European Commission) Framework 7 Programme (FP7). Six million euros were provided for a three-year plan to cope with spam and traps, and to conduct research on the time series variation of the Web.
Web archiving has a mission to conserve human cultural heritage. We should provide a wealth of information as an open library and promote research while ensuring transparency of national funding.
Copyright issues
In web archiving, it is important to cooperate with web administrators. We cannot avoid copyright issues in web archiving. However, digital information is lost unless we archive it. Enhancing copyrights leads to a great loss of the Web, which is cultural heritage.
* Slides of lectures are available here.
|
|
