cdodge@awi-bremerhaven.de
Beate Marx
bmarx@awi-bremerhaven.de
Hans Pfeiffenberger
pfeiff@awi-bremerhaven.de
A problem with any collection of information about Web documents is that it quickly becomes old. Strategies have been developed to maintain the database consistency with respect to changes on the Web, while attempting to keep network load to a minimum. This has been found to provide a better quality of response and it appears to be keeping information in the database current. Such strategies are of interest to anyone attempting to create and maintain a Web document location resource.
Although developed primarily to allow WWW access to users behind a firewall [Luotonen94], a proxy/caching server also helps to reduce network load and gives an increase in document retrieval speed with every cache hit. A side effect of running such a server is that the cache fills with a selection of WWW documents, which, for those interested in Web cataloguing and indexing can be a good supply of information freely available without incurring extra network costs. The usefulness of the information held in the cache is naturally dependent on those browsing the web, but at a research facility such as the Alfred Wegener Institute (AWI), which undertakes polar and marine research, the cache tends to contain many documents related to this field. Cache analysis, if effective, would provide a very elegant method of cataloguing the Web; it places no extra load on network resources as do automatic Web scanning mechanisms such as the Web Crawler [Pinkerton94] and WWW Worm [McBryan94], and it does not rely upon the cooperation of server administrators or Web authors for creating index files such as with ALIWEB [Koster94], or for manually entering their documents into some database.
The main question arising from such a proposed scheme is whether enough documents are collected in the proxy server cache to build a useful catalogue. As an experiment, a cache analysis program has been developed which places information about selected documents from the cache into a database, which can then be queried by internal or external users using a form on the AWI WWW server. It was hoped that once started, it would provide a means of distributing information about discovered Web resources and information among AWI scientists, encouraging them to search further, the results of which would then be found in the cache, and after the next cache scan, also be placed in the database. This positive feedback mechanism would then cause good database growth.
One of the main problems with any database of Web resources is that it can quickly become out of date as documents are changed, edited or moved, and servers change name or port number or are even shutdown. Strategies have been developed to maintain the database consistency with respect to changes on the Web, however the mechanism is such that minimum extra network load should be generated through this task. The implemented algorithm is therefore a compromise between database usefulness and additional data retrieval.
The first step is to index the HTML documents in the cache, which is done with the help of the ICE indexing package [Neuss94]. This produces a list of all words found in each document, and the number of times they occurred (ignoring any images) producing a so called `natural language index' of the document [Rowley87]. The second step is to feed the ICE output through a simple keyword filter in an attempt to extract documents of interest. This simply checks to see if certain words appear in the document, and if so, then the indexing information for the document is placed in the database. A typical entry in the keyword filter is Arctic or Antarctic, which will result in every document containing one of these words being placed in the database. Users of the database can add keywords themselves for particular areas of interest using a form, which has proved to be a fairly successful mechanism for creating a wide-ranging document filter.
A simple keyword filter mechanism such as this is naturally not very selective, but was quick to implement and performs the required task satisfactorily. A rudimentary analysis of the document URLs and titles in the database shows that the author considers at least 56% of the documents relevant to polar, marine and global change research, at least 27% are not at all relevant, and the remaining 17% could not be determined in this simple assessment. The use of a more intelligent document analysis system is a potential development point.
Cache Analysis
Naturally the proxy server cache contains many documents which are not of scientific interest to polar and marine scientists, so some document evaluation is needed. Figure 2. shows diagrammatically the basic steps involved in the cache analysis mechanism.
The Database
When a document is written into the database, what is actually written is the indexing information, meaning nearly every word that appears in the document with its number of occurrences, as is recommended by Pinkerton [Pinkerton94]. The document in its original form is not stored. For clarity, the information held in the database for one document will be referred to as a database entry. For each database entry, there exist a number of extra fields containing administrative information. The aim of these fields is to help with database administration, to allow the removal of obsolete database entries where possible, and where not, to give an indication of the age and accuracy of the entry. These extra fields are introduced below, with a short description of the information they contain, however their use is more fully explained in the next section.
0 Database entry is current.
1 Database entry is old.
2 Status Unknown.
3 Document unreachable, for example WWW server or machine
is not running.
Database Consistency Maintenance
The Web is dynamic; documents and servers are continually appearing, moving or being removed. Lists, catalogues or indexes can very quickly become out of date if some method is not undertaken to try and maintain the list consistency. Some current techniques, such as that used by ALIWEB [Koster94] rely on manually maintained index lists, or on deletion of part of the database and re-reading of the documents, as with the WWW Worm [McBryan94]. Ideally, the removal of obsolete information from a database (which we have termed weeding) should be automatic, but place the minimum possible load on the Web.
In the polar and marine research database, weeding is carried out by a program based on the CERN WWW library (libWWW) [Frystyk94], which uses the HEAD request to retrieve information on a subset of documents within the database.
A HEAD request asks the WWW server to return the HTTP header information for the requested document, the document itself is not returned. A typical document HEAD appears as:
1) Editing of a Web document is very unlikely to change its subject matter.
2) Editing of a Web document will usually not significantly change its contents.
Although the second assumption is somewhat weaker than the first, and although we have not yet empirically validated or otherwise the assumptions, they are based on our experience of managing a Web server. Therefore, we have adopted the following strategy; when a database entry is found to be old, then we retain the entry as it still considered useful, but it is marked as old so that database users are aware that the information may not be current.
Figure 3 shows in more detail the actions that take place during the weeding operation. When a valid head is returned, and it is found that the returned Last-modified field in the head is newer than the DBentry-Last-Modified, then the status field is set to mark the entry as old (Status = 1). If the head Last-modified field is identical to that of the DBentry-Last-Modified, then the Web document has not been edited since the database entry for this document was created, so the status remains as current (Status = 0).
If no head is returned, then libWWW returns an error message instead, which is evaluated by the weeding program. Maybe the cause of the problem is temporary, for example the WWW server machine is down, or there has been a network time out. If this is the case, then the status flag is then set to reflect this (Status = 3). If the problem is permanent, for example the URL given no longer exists, then the document is deleted from the database.
Whenever the weeding program retrieves the head of a document, then the Last-Checked-Date field of the database entry is updated. In this way, the daily running of the weeding program gradually works its way through the entire contents of the database, by always checking those entries with the oldest Last-Checked-Date.
FTP and Gopher servers do not return head information, so the status of the database entries for these documents is set to `unknown' (Status = 2).
Use of the weeding program alone would lead to an increasing number of database entries with the `old' status, but through constant cache turnover newer versions of documents appear from time to time. When the database feeding program encounters a document in the cache with a respective database entry marked as `old', then the entry is deleted, and the newer version used to create a new replacement database entry.
After 177 days of database existence, 59% of the database entries were current, 36% were old, 3% unknown and 2% unusable. This means that 98% of the database contents (the current, old and unknown entries) are available for queries. Since the introduction of the weeding program, it has removed 75 entries from the database (about 4.5% of the current database size); those have been found to no longer exist or be on servers that no longer exist.
This is with an average daily Web user base of about 30 users at AWI and a cache size of 60MBytes, which is usually between 30MBytes to 40MBytes full.
The number of documents is not that large when compared to other databases, such as the WebCrawler, however this does constitute a reasonable number of documents on this particular subject area, even if less than 60% of them are relevant. Just after day 90, the graph appears to flatten which is a result of the introduction of the weeding program. At around day 170 the number of database entries starts rising more steeply as a result of greater Web browsing activity within the institute.
The institute itself has about 400 scientists, the majority of whom have a computer of some sort, so the actual number of active users is currently small. As the Web usage gradually gains greater acceptance within the institute, it is anticipated that the number of documents will continue to increase.
This experiment has shown that a useful Web resource can be created simply by using documents collected in the cache of a proxy server. The weeding mechanism ensures that the majority of the database contents remain useable through automatic means that place a minimal load on the network.
Improvements and further work on the database can take place on two main fronts:
On the other hand, at present we don't have any estimate of how many relevant or interesting documents found in the cache fall through the keyword filter mechanism. Some experimentation, perhaps with a more systematically generated keyword list, or with the use of a thesaurus, may reduce the number false negatives, that is the number of interesting documents in the cache that are not selected by the keyword filter.
Currently the regular Web user base at AWI is quite small so as the Web usage continues to develop it is hoped that the database size and usefulness will gradually increase.
Nevertheless, it is doubtful whether one type of resource cataloguing mechanism is sufficient when applied to such a diverse and dynamic system such as the Web. The combination of cache analysis, perhaps with some other system such as directed, automatic document retrieval may create a more complete resource.
HTTP/1.0 200 OK
Date: Monday, 13-Feb-95 14:29:21 GMT
Server: NCSA/1.3
MIME-version: 1.0
Content-type: text/html
Last-modified: Tuesday, 15-Mar-94 15:02:00 GMT
Content-length: 2958
The weeding program runs once a night and retrieves the head of 60 documents; those with the oldest Last-Checked-Date. With this method it is possible to check whether the database entry for a document is out of date or not. The simplest course of action would then be to delete any entries from the database that are old, and automatically retrieve the whole new document for replacement of these particular database entries. This however, was considered to be an inefficient strategy in terms of Web load when working with the following assumptions:
Conclusion
Figure 4 shows the growth in the number of database entries for the first 177 days of the database existence.
An improved document selection mechanism.
As has already been stated, about one third of the database entries are known to be `noise'. A more intelligent document selection program would probably require a disproportionate increase in complexity for a reduction in the number of false positives. Combining Different Resource Collection Mechanism
References