4. Document Repository

The URL translation mechanism described in the previous section enhances file system access to WWW documents that exist in a locally mounted wide-area file system. This section describes a natural extension that uses the file system to mirror documents that are not otherwise available through the wide-area file system. A server constructs a document repository that clients access through the file system interface. The technique is similar to several recent projects that construct an intermediate document cache to improve performance [6].

4.1 Building a Repository

The document repository is created and maintained by a repository server. As shown in Figure 1, the repository server periodically reads a list of URLs to cache, retrieves the documents, stores them in the file system, and updates a file of URL translations. Currently, the list of URLs to cache in the repository is constructed by hand from history files, hot lists, and personal requests. We are developing a tool that will enable each user to maintain a list of URLs that should be cached. Several heuristics are used to remove from the list URLs for documents that frequently change such as queries, weather maps, and "what's new" pages.

Since documents can change over time, a refresh interval is associated with each URL. The refresh interval indicates the frequency of document retrieval by the repository server. A document signature is computed for each file that is transferred. If the signature is identical to the signature computed the last time the document was retrieved---i.e. the document has not changed since it was last retrieved---then the refresh interval is doubled up to a maximum of 32 days. If the document has changed, the refresh interval is reset to the default value, currently defined to be one day. If the interval is already the default, the URL is removed from the repository list.

To communicate the contents of the repository to the client, a URL translation file is created for the repository. The translation file contains a list of one-to-one translations for each document in the repository. For example, the following translation is for a document available through the server running at NCSA.

		http://www.ncsa.uiuc.edu/    file:/afs/transarc.com/Cache/rep0.html
Clients include the repository translation file when the service is initialized. Any access to a URL in the repository is translated into a request for the corresponding file.

If several repositories are available in the file system, the client includes repository translation files in the preferred order of access; local repositories first, remote repositories last. When accessing a document, the client attempts translations in the order in which they are included. Thus, the client first attempts to access a document through local repositories. Only if the document is unavailable in the local repository will it attempt to load the document from another repository.


Figure 1: A file system-based repository server.


4.2 Performance Impact

A document repository attempts to mirror in the file system the hot set of documents that are accessed by an individual or an organization. Local access through the file system is significantly faster than remote access through protocols such as HTTP or FTP. In addition, documents in the repository benefit from file system facilities for client caching and server replication as mentioned in Section 2.

The repository is not a cache; a cache saves a document only when it is accessed---i.e. strictly speaking the cache is reactive---and it saves all accessed files. Instead, a repository mirrors a fixed set of documents---an approximation of the hot set shown to exist in Section 2---whether or not they have been accessed recently. In other words, the repository is predictive. Given the large percentage of documents that are accessed only one time, we believe that this approach makes better use of disk space and network bandwidth. We are studying the performance impact of the repository.


Mirjana Spasojevic, C. Mic Bowman, and Alfred Spector.