Survey on Comprehensive Collection, Storage, and Archiving of Japanese Web Sites Outline
1. Introduction
From October 2004 to March 2005, a survey of web data in Japan was conducted for the purpose of studying the feasibility of and methodology for collecting, storing and archiving Japanese web sites. According to the survey, the total amount of web data in Japan as of March 2005 was estimated at 18.4 TB, and the total number of files at 450 million. These results are presented below, along with the results of studies on web archiving requirements.
2. Survey Overview
2.1. Web Crawling Survey
A web crawling survey was carried out to ascertain the amount of Web data in Japan, and how much of it contains metadata usable in archiving management (e.g., title, author name, keywords). The specifications and requirements for this survey are shown in Table 1.
| No. | Category | Item | Specifications/Requirements |
|---|---|---|---|
| 1 | Survey specifications |
Length of survey | Approx. 60 days |
| 2 | Average file collection rate | 6 million files/day | |
| 3 | Surveyed web servers | Web servers in JP domain and those having IP addresses under JPNIC (Japan Network Information Center) jurisdiction | |
| 4 | Robot exclusion | Use of robots.txt and meta tags for robot exclusion | |
| 5 | Surveyed file types | All | |
| 6 | Extracted links | HTML, JavaScript, PDF, cascading style sheets, etc., excluding links that require access authorization or filling out forms | |
| 7 | Crawling specifications |
Content management database | Oracle10g on a cluster of 4 PC servers 350 million records |
| 8 | Internet connection | 60Mbps (max. 100 Mbps) | |
| 9 | Crawler operation | Distributed processing by multiple threads/processes | |
| 10 | Number of crawlers | 13 PC servers |
2.2. Web Crawling Survey Results
2.2.1. Surveyed Data Volume and Total Estimated Data
The time actually spent in the crawling survey, excluding the time for handling inquiries on the survey or other procedures, was 23 days. The volume of web data surveyed during that period was 4.9 TB, and the number of surveyed files was 120 million files. Taking the overlapping URL discovery rate during the survey period into account, it is estimated that the total web data and number of files in Japan are approximately 18.4 TB and 450 million files, respectively. These results are shown in Table 2.
| Item | Amount |
|---|---|
| Web data surveyed | 4.9 TB |
| Files surveyed | 120 million files |
| Estimated total Web data | 18.4 TB |
| Estimated total files | 450 million files |
2.2.2. Surveyed Web Data Volume and Files by File Type
Web data volume and number of files obtained in this survey are shown in Table 3, classified by file type.
| Type | Data Volume (GB) |
Files | Average File Size (KB) |
|---|---|---|---|
| Static HTML | 332.3 | 29,898,744 | 11.1 |
| Dynamic HTML | 182.5 | 13,812,351 | 13.2 |
| Images | 1,105.9 | 55,128,641 | 20.1 |
| Text | 981.5 | 3,186,376 | 308.0 |
| Moving image | 888.4 | 333,751 | 2,661.9 |
| Sound | 114.1 | 178,399 | 639.6 |
| Data | 859.0 | 590,268 | 1,455.2 |
| Streaming metafiles | 1.5 | 176,254 | 8.8 |
| Other | 438.2 | 18,698,988 | 23.4 |
| Total | 4,903.5 | 122,003,772 | 40.2 |
2.2.3. Number of Hosts per Domain
A tally of unique hosts for each domain, derived from the host name portion of the URLs surveyed, is shown in Figure 1. In the "jp" domain as a whole, some 180,000 hosts were surveyed, of which approximately a third were web hosts in the "co.jp" domain. The survey covered approximately 130,000 non-"jp" domain hosts, more than half of which are in the ".com" domain.
2.2.4.Web Data Volume and Files in Hosts
The volume of web data and number of files in each host were tallied in this survey. The results are shown in Table 4 and Table 5. The data show that 96 % of web hosts have no more than 2,000 files. The results also show that though the number is not large, there are web sites storing a very large volume of web data (102 GB being the largest volume), and web hosts storing very large numbers of files (72,773 files being the most).
| Web Data (MB) | Hosts | Rate | Cumulative Rate |
|---|---|---|---|
| 20 | 282,298 | 0.9082 | 0.9082 |
| 40 | 11,195 | 0.0360 | 0.9443 |
| 60 | 4,804 | 0.0155 | 0.9597 |
| 80 | 2,754 | 0.0089 | 0.9686 |
| 100 | 1,811 | 0.0058 | 0.9744 |
| 120 | 1,294 | 0.0042 | 0.9786 |
| 140 | 945 | 0.0030 | 0.9816 |
| 160 | 735 | 0.0024 | 0.9840 |
| 180 | 609 | 0.0020 | 0.9859 |
| 200 | 460 | 0.0015 | 0.9874 |
| 220 or more | 3911 | 0.0126 | 1.0000 |
| Total | 310,816 | 1.0000 | --- |
| Files | Hosts | Rate | Cumulative Rate |
|---|---|---|---|
| 1,000 | 284,028 | 0.9138 | 0.9138 |
| 2,000 | 14,556 | 0.0468 | 0.9606 |
| 3,000 | 4,880 | 0.0157 | 0.9763 |
| 4,000 | 2,274 | 0.0073 | 0.9837 |
| 5,000 | 1,250 | 0.0040 | 0.9877 |
| 6,000 | 875 | 0.0028 | 0.9905 |
| 7,000 | 582 | 0.0019 | 0.9924 |
| 8,000 | 423 | 0.0014 | 0.9937 |
| 9,000 | 291 | 0.0009 | 0.9947 |
| 10,000 | 232 | 0.0007 | 0.9954 |
| 11,000 or more | 1425 | 0.0046 | 1.0000 |
| Total | 310,816 | 1.0000 | --- |
2.2.5. Number and Rate of Metadata Settings
The summary of rates of metadata setting such as title is shown in Table 6. Whereas "title" was set in 95 % of HTML files, "author" was set in only 5 %, "description" in 14 %, and "keywords" in 14 % of the surveyed files. Moreover, metadata based on the Dublin Core Metadata Elements and described as <meta_dc.title> and <link_dc.creator> were extremely rare (less than 0.3 %). Because of the low percentage of the metadata settings it is not realistic to classify HTML files only by metadata.
| No. | Type | Item | Number of files (1000) |
Rate | Total length (MB) |
Max. length (bytes) | Avg. length (bytes) |
|---|---|---|---|---|---|---|---|
| 1 | Title | Set in <TITLE> tag or as <META name="title"> | 47,717 | 95.03% | 805 | 5,073 | 17.7 |
| 2 | Author | Set as <META name="author"> | 2,523 | 5.02% | 33 | 592 | 13.9 |
| 3 | Description | Set as <META name="description"> | 7,058 | 14.06% | 389 | 4,958 | 57.8 |
| 4 | Keywords | Set as <META name="keywords"> | 7,064 | 14.07% | 531 | 5,537 | 78.9 |
| 5 | Language | Set as <HTML lang="XXX"> | 8,991 | 17.91% | 19 | 95 | 2.2 |
| 6 | Character encoding | Set in <META> tag or as character code setting in HTTP header | 39,308 | 78.28% | --- | --- | --- |
| 7 | Description | RDF files | 13 | 0.03% | --- | --- | --- |
| 8 | RSS files | 14 | 0.03% | --- | --- | --- | |
| 9 | Cookies | Set-Cookie header | 11,480 | 22.86% | 762 | 3,860 | 69.6 |
3. Web Archiving Requirements
Summarized here are the requirements for web archiving that this survey revealed in respect of three functions: data collection, storing/archiving, and browsing (viewing).
3.1 Collecting Web Data
3.1.1 Prior Notice and Handling of Inquiries
a. Putting up a web page describing the data gathering methods and conditions, and maintaining a FAQ
It is especially important to indicate the means for data providers to refuse collection (robot exclusion). Also necessary is to publicize the means of specifying usage restriction types of archived information.
b. Deciding web sites for priority collection and asking for cooperation
After determining the web sites (and domains) for priority collection, the number of files and data volume on those web sites must be confirmed in advance. The conditions to complete the collection within the predetermined time frame must be sorted out. Permission from the managers of the sites to make high-frequency collection or their agreement to submit files on removable media may be necessary.
c. Deciding conditions for data gathering from general sites (request intervals, maximum transfer rate, etc.)
Web data must be collected without causing excessive load on the sites. Judging by the inquiries and comments from site managers and others during this survey, the collection conditions set in this survey— sending requests at 30-second intervals, capping transfer rates at 1 Mbps, and limiting file size to 60 MB—can serve as rough guidelines of what are appropriate conditions.
d. Seeking cooperation of the Internet service providers, etc.
It is necessary to have providers inform about the content re-edited or deleted by them at their discretion for human-rights protection or other reasons so that these actions can be taken also for files collected in the web archive.
e. Collecting starting-point URLs
Before starting web data gathering, a sufficient number of URLs need to be collected to be used as starting points for web crawling. Ideally, URLs for data gathering should continue to be collected, and should be sorted to enable efficient data gathering.
3.1.2 Collection Scope and Volume of Gathered Data
a. Collection scope
Initially the scope is the same as for this survey. However, it will be reviewed periodically, based on trends in protocol and data type use.
b. Estimated data volume to be collected
During FY2006, approximately 540 million files shall be collected, totaling 22 TB. If current trends hold, the volume is expected to grow by 20 to 30 percent annually after that, and it will need to be reviewed each year.
3.1.3 Collection Conditions
a. Request intervals and transfer rate
In consecutive access to sites for which advance permission has not been received, pauses of 30 seconds will be inserted between accesses, and maximum transfer speed will be up to 1Mbps. If collection cannot be completed for a site within the period under these conditions, only partial collection will be made. The number of files expected to be collected, while dependent on the file size distribution in a site, is 60,000 files per site on average in a 60-day collection period.
b. Per-site collection conditions
Collection condition settings must be changeable for each site in order to allow for high-frequency/high-speed transfer at sites for priority collection, and to allow for specifying the dates and times of collection at certain sites.
c. Observance of robot exclusion settings
User agent names of all crawlers used by the National Diet Library, such as that for WARP (user agent name: ndl-japan-warp-0.1) must be made recognizable to site managers. Proprietary extensions of robot exclusion specification methods, such as those used for the crawlers of popular search services like Google and Yahoo!, must be supported, too.
3.1.4 Collection Performance
a. General sites
Using 60 days per year as the collection period, the required collection performance for the FY2006 is approximately 10 million URLs per day on average. System scalability must be ensured to boost annual performance by at least 20 to 30 percent each year.
b. Priority sites
A shorter request interval is necessary for Web sites with large numbers of files. For example, to collect a site with 300,000 files in 60 days, the request interval must be around 10 seconds. Moreover, in the case of sites requiring high link consistency to be maintained, the ability to make one or more requests per second is necessary.
3.1.5 Collection Functions
a. Link extraction
The challenge is to achieve high link extraction accuracy, especially when links are created dynamically in the browser with, for example, JavaScript. Technically, however, raising this accuracy is so difficult that it requires further study. In some cases, URLs are generated based on data entered by the user. A crawler used for collecting websites cannot generate such URLs.
b. Assuring security
Care must be taken not to access dangerous links, such as links that overwrite pages, links that are generated almost endlessly such as shared calendars, or links to malicious attack scripts. This will require ongoing technical development, including efforts to improve security by analyzing content along with link information.
3.1.6 User Confirmation of Collection Status
When there is a time lag between collection and making the collected sites available, a function must be provided enabling users to confirm whether a website has been collected or not by designating a URL.
3.2. Storing and Archiving Web Data
Storing and archiving of Web data are outlined here with regard to the information to be archived, the archiving format, and handling of requests to delete archived data.
3.2.1 Information to Be Archived
a. Content-related information
Collected files will vary depending on the conditions at the time of a request, such as the crawler's IP address and request information. It will therefore be necessary to archive not only the collected files but also the related information such as data exchanged at the TCP/IP layer and above, and the date and time of collection.
b. Reason for exclusion
When a content is excluded from the automated collection by a META tag in a page, since the page itself is not retained, the META tag information is likewise not retained. Such information must be saved separately including in the event that the collection is refused in writing or in some other way.
c. Log of deletions and edits
A log of deletions and edits made after collection will be kept.
d. Information indicating usage restrictions
When the viewing of archived information is restricted according to the conditions presented by the copyright holders and others, the information indicating such restrictions must be saved.
3.2.2 Archiving Format
As the archiving format should be based on open standards, ARC format and its extension are preferable. However, there will be the possibility that existing tools (open source browsing software, etc.) will become obsolete. The format, in addition to the collection coverage, should be decided after careful consideration.
Details of the proper format should be developed after a survey of the formats used in similar systems such as the WARP and The Internet Archive.
3.2.3 Handling of Requests to Delete or Edit the Archived Data
a. Method of request acceptance
If possible, requests should be accepted automatically, in a way such as using web forms. Identification will be done by checking whether robot exclusion has been set. If the requester can set the robot exclusion, he/she should be recognized as the site manager and the request is reasonable. It will therefore be possible to automate the reception process by having the requester make this robot exclusion setting as part of the web form submission, and deleting material after this setting is confirmed.
b. Units of deletion and editing
Data archived in ARC format will be deleted or edited in file units. Editing may not be possible in some cases, however, when it is prohibited in document files and the like.
3.3. Browsing Archived Web Data
Browsing of archived data is outlined here, focusing on the important browsing functions specific to the web archive.
3.3.1 Browsing Functions
a. URL rewriting function
Links in collected files need to be rewritten to jump to locations inside the Web archive site. Thus all the links should be extracted and rewritten. The problems of link extraction in this case are the same as those described in section 3.1.5 a. In the case of pdf and other document files where rewriting of links would ruin the document file layout, or where the setting prevents editing, the link will not be rewritten.
b. Access control and file provision restriction functions
Considering the cases when viewing is possible only inside the Library, or when only archiving is allowed and viewing is not allowed, usage restrictions must be settable at the file level. Moreover, since content provision is thought to be restricted based on file types, it is necessary to make it possible to designate different conditions for different file types and domains (hosts).
c. Navigation functions
In addition to the requirements applied to ordinary website navigation, archived websites requires measures to prevent viewers from being unknowingly navigated by links to documents outside the archive. However, the web archive system cannot control links that are generated while browsing (e.g., JavaScript) and the URL ends up being the one generated at the time of browsing. As a result, the viewer may stray outside the archive without noticing it. Further study is needed to solve this problem.
3.3.2 Search Functions
Basically archived sites will searchable by designating the time of collection and URL. Search by metadata such as title, author and keywords is possible only for HTML in which these are described. Since metadata other than title are described in only 10 percent of webpages, the search coverage will be low. It will therefore be necessary to consider provision of full-text search, or provision of classification information using automatic classifying support based on advanced content analysis.
3.3.3 Handling Requests for Usage Restrictions
As described in section 3.2.3 Handling of Requests to Remove or Edit the Archived Data, the automation of this process needs to be considered.
3.3.4 Interface for Linking with Other Systems
The value of the web archive will be enhanced in various ways by linking up with portal sites and other digital archive systems, and by providing the means for automatic access from users' computers. It is recommended to develop standard interfaces used in web services, and interfaces based on the metadata harvesting protocol such as OAI-PMH.
