WebTracker is a Web service designed to address these needs. It allows users to express interest in a set of URLs, or in a set of patterns for URLs. When such URLs change or are created, the user is notified.
In particular, the Web provides no built-in support for monitoring or tracking said documents, whereby a user is notified whenever a document changes in an interesting way, and is allowed to easily see that change. A few attempts have been made to address this, as discussed in the Related Work section, but none were powerful enough for our needs. This paper presents our attempt to address this, a system we call WebTracker.
WebTracker allows users to monitor a set of web pages (URLs), with the following features:
When the user clicks on a URL title, they are presented with a more detailed set of information about that URL: its size, the set of other users interested in that URL, when it was last downloaded, and so forth. By clicking on a button, the user subscribes to that URL - whenever the URL changes, the user will be sent e-mail notifying them of the change, and allowing them to easily see the difference.
For example, the figure below shows the information a user sees who is interested in tracking the URL with information about the WWW6 conference: http://www6conf.slac.edu. The name 'fishkin' is in bold, indicating that this is the name of the current user: if 'fishkin' was not presently subscribed to this URL, the 'unsubscribe myself' button would instead read 'subscribe myself'.
The user can interact with the system at 3 levels. At the top level, the user sees all or some of the tracked URLs. At the second level, the user sees the information associated with a particular URL, as shown in the previous figure. At the third level, the user sees 'sub-URLs' obtained by pattern search on the contents of a particular URL, as will be described later. The user interface uses 'progressive disclosure' or holophrasting to accomplish this - at each level, the user is shown the surrounding context of higher levels, while progressively more detailed information is shown according to their current level.
Presently, the semantics of this are to compare the two most recent versions of the URL contents by means of the Unix diff capability, and to present the results of that to the user.
For example, the figure below shows the information the user receives when checking a URL which contains scheduling information for a conference room: the 'diff' output shows the user, without having to parse the entire file, that a talk by Hadar Shemtov is now tentatively scheduled for 3/11/97.
The difference between the last two versions of URL http://parcweb.parc.xerox.com/project/istl/whistle/schedule.text 14c14 < 3/11/97 --- > 3/11/97 Hadar Shemtov: Tentative click here to return to the application.
Often, a user isn't interested in a URL per se, but rather some of the URLs referenced by that URL. For example, consider a URL which contains a 'download' page, of the most recent drivers, patches, etc. for users to download. When this page changes, the user is not interested in the change in the page, but rather in the fact that a new downloadable document has been posted.
Accordingly, WebTracker allows the user to specify a set of regular expressions (filters) to associate with a particular URL. When a new embedded URL is found in the host URL, if that URL fits the given filter, the embedded URL is automatically downloaded and made available to the user.
For example, in the figure below, WebTracker is tracking a URL used by Creative Labs, makers of PC sound cards, to present drivers and patches for their cards. There are two 'filters' that users have added to this: any reference within this URL to another URL which ends with '.exe', or one which ends in '.zip', will be automatically downloaded. At present, 9 such .exe files have been found and downloaded, and 0 .zip files.
Just as a URL is downloaded when it first appears via a filter, when such a URL is no longer found via the filter (i.e. it is no longer referenced by the parent URL), the sub-URL and its associated local storage are deleted. For example, one of the URLs we track is the download page for the latest Norton anti-virus files. When the anti-virus files for April, for example, are posted, the anti-virus file for March will no longer be referenced by the parent URL, and accordingly will be deleted locally.
Greetings from WebTracker ( http://girweb/cgi-bin/webtracker ) Some of the URLs which I am tracking for you have changed. In particular: 1) URL http://www.creaf.com/creative/drivers/3db/3dpb73p1.exe , obtained via the '*.exe' filter of the 'drivers for Creative Labs (PC sound cards)'' nugget. has been downloaded for the first time. a local cached version has been stored in /project/webtracker/data/3dpb73p1.exe a URL for this is http://parcweb/project/webtracker/data/3dpb73p1.exe WebTracker's page with information about this nugget is http://girweb/cgi-bin/webtracker?op_16_14_*.exe=x 2) URL /www6 titled 'WWW6 Details' has changed. a local cached version has been stored in /project/webtracker/data/url_b1957.html a URL for this is http://parcweb/project/webtracker/data/url_b1957.html a URL to see what changed is: http://girweb/cgi-bin/webtracker?op_10_11_X=X WebTracker's page with information about this nugget is http://girweb/cgi-bin/webtracker?focus=11#L11 Yours, WebTracker ( http://girweb/cgi-bin/webtracker )
Finally, the Grassroots architecture of Stanford incorporates such URL tracking within its system for document management. However, GrassRoots is an architecture, not an implementation of a service.
The WebTracker system has been in use amongst the employees of the Information Sciences and Technology (ISTL) group at Xerox PARC, a group of roughly 40 people, for a bit over a month. Currently, WebTracker is tracking 56 URLs (25 specified directly, and 31 through pattern-matching), and is notifying 13 users.
The system is implemented in GNU G++, and runs on a Sun Sparcstation 10 using SUNOS 4.1.3. URL monitoring is implemented twice nightly - once at 1 AM, and again at 3 AM. This twice-nightly monitoring is done to reduce the odds that a particular server will be down when the monitoring is performed. There is no particular architectural limitation on the frequency of the monitoring.
To support collaborative use of the system, semaphores are used to ensure that multiple users can access and change the WebTracker database.