Web Cataloguing Through Cache Exploitation and Steps Toward Consistency Maintenance


Chris Dodge
Computer Centre
The Alfred Wegener Institute for Polar and Marine Research
Am Handelshafen 12
27570 Bremerhaven
Germany
cdodge@awi-bremerhaven.de

Beate Marx
Computer Centre
The Alfred Wegener Institute for Polar and Marine Research
Am Handelshafen 12
27570 Bremerhaven
Germany
bmarx@awi-bremerhaven.de

Hans Pfeiffenberger
Computer Centre
The Alfred Wegener Institute for Polar and Marine Research
Am Handelshafen 12
27570 Bremerhaven
Germany
pfeiff@awi-bremerhaven.de


Abstract
This paper presents a new Web cataloguing strategy based upon the automatic analysis of documents stored in a proxy server cache. This could be an elegant method of Web cataloguing as it creates no extra network load and runs completely automatically. Naturally such a mechanism will only reach a subset of Web documents, but at an institute such as the Alfred Wegener Institute, due to the fact that scientists tend to make quite good search engines, the cache usually contains large numbers of documents related to polar and marine research. Details of a database for polar, marine and global change research, based upon a cache scanning mechanism are given, and it is shown that it is becoming an increasingly useful resource.

A problem with any collection of information about Web documents is that it quickly becomes old. Strategies have been developed to maintain the database consistency with respect to changes on the Web, while attempting to keep network load to a minimum. This has been found to provide a better quality of response and it appears to be keeping information in the database current. Such strategies are of interest to anyone attempting to create and maintain a Web document location resource.

Keywords
WWW, cataloguing, indexing, database, consistency maintenance, resource location.


Introduction

When browsing the Web, requests are usually sent directly from the browser to the Web server of interest, wherever in the world that may be. If however, the browser is configured to use a proxy server, then requests are directed through the proxy server as shown in figure 1. The returned document arrives first at the proxy server, which then passes it on to the browser, and if it is so configured will also cache the document.


Although developed primarily to allow WWW access to users behind a firewall [Luotonen94], a proxy/caching server also helps to reduce network load and gives an increase in document retrieval speed with every cache hit. A side effect of running such a server is that the cache fills with a selection of WWW documents, which, for those interested in Web cataloguing and indexing can be a good supply of information freely available without incurring extra network costs. The usefulness of the information held in the cache is naturally dependent on those browsing the web, but at a research facility such as the Alfred Wegener Institute (AWI), which undertakes polar and marine research, the cache tends to contain many documents related to this field. Cache analysis, if effective, would provide a very elegant method of cataloguing the Web; it places no extra load on network resources as do automatic Web scanning mechanisms such as the Web Crawler [Pinkerton94] and WWW Worm [McBryan94], and it does not rely upon the cooperation of server administrators or Web authors for creating index files such as with ALIWEB [Koster94], or for manually entering their documents into some database.

The main question arising from such a proposed scheme is whether enough documents are collected in the proxy server cache to build a useful catalogue. As an experiment, a cache analysis program has been developed which places information about selected documents from the cache into a database, which can then be queried by internal or external users using a form on the AWI WWW server. It was hoped that once started, it would provide a means of distributing information about discovered Web resources and information among AWI scientists, encouraging them to search further, the results of which would then be found in the cache, and after the next cache scan, also be placed in the database. This positive feedback mechanism would then cause good database growth.

One of the main problems with any database of Web resources is that it can quickly become out of date as documents are changed, edited or moved, and servers change name or port number or are even shutdown. Strategies have been developed to maintain the database consistency with respect to changes on the Web, however the mechanism is such that minimum extra network load should be generated through this task. The implemented algorithm is therefore a compromise between database usefulness and additional data retrieval.


Cache Analysis

Naturally the proxy server cache contains many documents which are not of scientific interest to polar and marine scientists, so some document evaluation is needed. Figure 2. shows diagrammatically the basic steps involved in the cache analysis mechanism.


The first step is to index the HTML documents in the cache, which is done with the help of the ICE indexing package [Neuss94]. This produces a list of all words found in each document, and the number of times they occurred (ignoring any images) producing a so called `natural language index' of the document [Rowley87]. The second step is to feed the ICE output through a simple keyword filter in an attempt to extract documents of interest. This simply checks to see if certain words appear in the document, and if so, then the indexing information for the document is placed in the database. A typical entry in the keyword filter is Arctic or Antarctic, which will result in every document containing one of these words being placed in the database. Users of the database can add keywords themselves for particular areas of interest using a form, which has proved to be a fairly successful mechanism for creating a wide-ranging document filter.

A simple keyword filter mechanism such as this is naturally not very selective, but was quick to implement and performs the required task satisfactorily. A rudimentary analysis of the document URLs and titles in the database shows that the author considers at least 56% of the documents relevant to polar, marine and global change research, at least 27% are not at all relevant, and the remaining 17% could not be determined in this simple assessment. The use of a more intelligent document analysis system is a potential development point.


The Database

When a document is written into the database, what is actually written is the indexing information, meaning nearly every word that appears in the document with its number of occurrences, as is recommended by Pinkerton
[Pinkerton94]. The document in its original form is not stored. For clarity, the information held in the database for one document will be referred to as a database entry. For each database entry, there exist a number of extra fields containing administrative information. The aim of these fields is to help with database administration, to allow the removal of obsolete database entries where possible, and where not, to give an indication of the age and accuracy of the entry. These extra fields are introduced below, with a short description of the information they contain, however their use is more fully explained in the next section.

DBentry-Last-Modified
This is the last modified date of the document as it was found in the cache, giving rise to this particular database entry.

Last-Checked-Date
Is the date on which this database entry was last checked. When a new document is found in the cache, and is used to create a new database entry, the Last-Checked-Date field is set to the current date.

WWWDoc-Last-Modified.
The HTTP HEAD request (see below) is used to determine if a document on the Web has changed since a database entry was made. If the Web document has now changed, and therefore the database entry is now old, then this date is set to the last modified date of the document on the Web.

Status
This is set to reflect the actual status of this document entry. The values this can take are as follows:

   0   Database entry is current.
   1   Database entry is old.
   2   Status Unknown.
   3   Document unreachable, for example WWW server or machine
       is not running.


Database Consistency Maintenance

The Web is dynamic; documents and servers are continually appearing, moving or being removed. Lists, catalogues or indexes can very quickly become out of date if some method is not undertaken to try and maintain the list consistency. Some current techniques, such as that used by ALIWEB
[Koster94] rely on manually maintained index lists, or on deletion of part of the database and re-reading of the documents, as with the WWW Worm [McBryan94]. Ideally, the removal of obsolete information from a database (which we have termed weeding) should be automatic, but place the minimum possible load on the Web.

In the polar and marine research database, weeding is carried out by a program based on the CERN WWW library (libWWW) [Frystyk94], which uses the HEAD request to retrieve information on a subset of documents within the database.

A HEAD request asks the WWW server to return the HTTP header information for the requested document, the document itself is not returned. A typical document HEAD appears as:

    HTTP/1.0 200 OK
    Date: Monday, 13-Feb-95 14:29:21 GMT
    Server: NCSA/1.3
    MIME-version: 1.0
    Content-type: text/html
    Last-modified: Tuesday, 15-Mar-94 15:02:00 GMT
    Content-length: 2958
 
The weeding program runs once a night and retrieves the head of 60 documents; those with the oldest Last-Checked-Date. With this method it is possible to check whether the database entry for a document is out of date or not. The simplest course of action would then be to delete any entries from the database that are old, and automatically retrieve the whole new document for replacement of these particular database entries. This however, was considered to be an inefficient strategy in terms of Web load when working with the following assumptions:

1) Editing of a Web document is very unlikely to change its subject matter.

2) Editing of a Web document will usually not significantly change its contents.

Although the second assumption is somewhat weaker than the first, and although we have not yet empirically validated or otherwise the assumptions, they are based on our experience of managing a Web server. Therefore, we have adopted the following strategy; when a database entry is found to be old, then we retain the entry as it still considered useful, but it is marked as old so that database users are aware that the information may not be current.

Figure 3 shows in more detail the actions that take place during the weeding operation. When a valid head is returned, and it is found that the returned Last-modified field in the head is newer than the DBentry-Last-Modified, then the status field is set to mark the entry as old (Status = 1). If the head Last-modified field is identical to that of the DBentry-Last-Modified, then the Web document has not been edited since the database entry for this document was created, so the status remains as current (Status = 0).


If no head is returned, then libWWW returns an error message instead, which is evaluated by the weeding program. Maybe the cause of the problem is temporary, for example the WWW server machine is down, or there has been a network time out. If this is the case, then the status flag is then set to reflect this (Status = 3). If the problem is permanent, for example the URL given no longer exists, then the document is deleted from the database.

Whenever the weeding program retrieves the head of a document, then the Last-Checked-Date field of the database entry is updated. In this way, the daily running of the weeding program gradually works its way through the entire contents of the database, by always checking those entries with the oldest Last-Checked-Date.

FTP and Gopher servers do not return head information, so the status of the database entries for these documents is set to `unknown' (Status = 2).

Use of the weeding program alone would lead to an increasing number of database entries with the `old' status, but through constant cache turnover newer versions of documents appear from time to time. When the database feeding program encounters a document in the cache with a respective database entry marked as `old', then the entry is deleted, and the newer version used to create a new replacement database entry.

After 177 days of database existence, 59% of the database entries were current, 36% were old, 3% unknown and 2% unusable. This means that 98% of the database contents (the current, old and unknown entries) are available for queries. Since the introduction of the weeding program, it has removed 75 entries from the database (about 4.5% of the current database size); those have been found to no longer exist or be on servers that no longer exist.


Conclusion

Figure 4 shows the growth in the number of database entries for the first 177 days of the database existence.

This is with an average daily Web user base of about 30 users at AWI and a cache size of 60MBytes, which is usually between 30MBytes to 40MBytes full.

The number of documents is not that large when compared to other databases, such as the WebCrawler, however this does constitute a reasonable number of documents on this particular subject area, even if less than 60% of them are relevant. Just after day 90, the graph appears to flatten which is a result of the introduction of the weeding program. At around day 170 the number of database entries starts rising more steeply as a result of greater Web browsing activity within the institute.

The institute itself has about 400 scientists, the majority of whom have a computer of some sort, so the actual number of active users is currently small. As the Web usage gradually gains greater acceptance within the institute, it is anticipated that the number of documents will continue to increase.

This experiment has shown that a useful Web resource can be created simply by using documents collected in the cache of a proxy server. The weeding mechanism ensures that the majority of the database contents remain useable through automatic means that place a minimal load on the network.

Improvements and further work on the database can take place on two main fronts:

An improved document selection mechanism.

As has already been stated, about one third of the database entries are known to be `noise'. A more intelligent document selection program would probably require a disproportionate increase in complexity for a reduction in the number of false positives.

On the other hand, at present we don't have any estimate of how many relevant or interesting documents found in the cache fall through the keyword filter mechanism. Some experimentation, perhaps with a more systematically generated keyword list, or with the use of a thesaurus, may reduce the number false negatives, that is the number of interesting documents in the cache that are not selected by the keyword filter.

Combining Different Resource Collection Mechanism

Currently the regular Web user base at AWI is quite small so as the Web usage continues to develop it is hoped that the database size and usefulness will gradually increase.

Nevertheless, it is doubtful whether one type of resource cataloguing mechanism is sufficient when applied to such a diverse and dynamic system such as the Web. The combination of cache analysis, perhaps with some other system such as directed, automatic document retrieval may create a more complete resource.


References

[Frystyk94]
Frystyk, H., Håkon, W. L. Towards a Uniform Library of Common Code. The Second International WWW Conference `94: Mosaic and the Web. http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/DDay/frystyk/LibraryPaper.html

[Koster94]
Koster M., ALIWEB: Archie-Like Indexing in the Web. Proceedings of the First International World-Web Conference. http://www1.cern.ch/PapersWWW94/aliweb.ps

[Luotonen94]
Lutotonen, A., Altis, K. World-Wide Web Proxies. Proceedings of the First International World-Web Conference. http://www1.cern.ch/PapersWWW94/luotonen.ps

[McBryan94]
McBryan, O., GENVL and WWWW: Tools for Taming the Web. http://www1.cern.ch/PapersWWW94/mcbryan.ps

[Neuss94]
Neuss, C., Höfling, S. Lost in Hyperspace? Free Text Searches in the Web. Proceedings of the First International World-Web Conference. http://www1.cern.ch/PapersWWW94/neuss.ps

[Pinkerton94]
Pinkerton, B. Finding What People Want: Experiences with the WebCrawler. The Second International WWW Conference `94: Mosaic and the Web. http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/pinkerton/WebCrawler.html

[Rowley87]
Rowley, J.E. Organising Knowledge. Gower Publishing Company Limited. 1987.