Distributed Collaborative Caching for Proxy Servers

Mangesh Kasbekar and Vikram Desai

220, Pond Lab

Department of Computer Sc and Engg

The Pennsylvania State University

University Park, PA 16802.

{kasbekar,vdesai}@cse.psu.edu

Abstract

Caching schemes have been proposed for www clients for reducing network traffic and improving access latency of the document. Collaborative caching of web documents at client end has been shown to be an effective technique for reducing web traffic and improving access latencies. This paper proposes a distributed scheme for collaborative caching, in which the proxy server does not cache actual documents, but maintains an index of the local caches of the individual users that it services. The proxy server itself is distributed in nature, which ensures high availability and load balancing.

Introduction

The enormous growth of WWW based services has made the network traffic a major concern over the last few years. It is common to encounter long delays in web document retrieval due to slow connections, network congestions, remote server overloading etc. Most WWW clients use memory and disk caches for speeding up accesses to frequently used web documents.

In collaborative caching schemes, read access to cached document is given to the cooperating group of users. Such an integration creates an illusion of a single cache of much bigger size. Since all users of the group use the same integrated cache, it has an added advantage that the probability of the next accessed document being found in the cache is high. Collaborative caching schemes provide a better hit ratio than individual caching schemes. Hit ratios in the range of 30-50% have been reported in [1].Caching proxy servers implement collaborative caching.

Proxy servers control all accesses to the web from the subnet they service. Caching Proxies can be designed in a centralized or distributed fashion. In the centralized design, in addition to servicing the requests of the clients from the subnet, a proxy also caches the documents in its local cache. Here, the cache maintained by the proxy contains the documents accessed by all the individual clients. Making it available to everyone ensures a better cache hit rate [4]. The disadvantages of this scheme arise from the centralized nature of the proxy server and the cache. Systems that use centralized servers are not considered fault tolerant and can become hot-spots and bottlenecks when the load is high. The issues become more prominant when a single proxy is the only gateway to the internet for a large group of users. Hence a proxy server serving a large group of people generally needs a lot of resources.

This paper proposes a Distributed alternative to Collaborative Caching, (Henceforth referred to as DCC) in which the proxy server does not actually cache any documents, but maintains an index of all the documents cached by its clients within the subnet. The proxy itself is designed as a distributed server for providing high availability and reliability.

New technology trends make it possible to make the performance of this scheme comparable to other collaborative caching schemes [6] [3] [5]. First, LANs are becoming much faster than WANs. In addition to lower bandwidth, WAN traffic uses heavyweight protocols, which worsen the latency for short messages. On the other hand, LANs have higher bandwidths, and lightweight protocols have been developed for the high speed LANs, which benefit smaller messages between machines [9] [10] [11]. Hence, message passing within the distributed proxy server is always fast. Due to these improvements, the time required to consult local machines is a small fraction of the time required to fetch the required document from remote WWW server. Secondly, individual workstations now have fast processors and local disks of very high capacities, which can allow high disk cache sizes for individual clients.

The DCC scheme makes use of data cached in the individual clients, and hence new security issues come in picture, which do not arise in case of centralized caching scheme.

DCC overview

This section provides a brief overview of the DCC system. DCC is being implemented on a network of Sun workstations running Solaris2.5.1 connected by Myrinet[12]. The WWW client used is Netscape 3.01Gold.

The DCC system consists of a distributed proxy server and WWW clients. The clients referred to above differ from the currently existing WWW clients (eg Netscape) in that they have an additional functionality to receive requests for cached documents from the proxy server and service them. Since no commercially available clients (including netscape) support this, there can be two ways to provide the facility :

Instrument the client code, and add new routines to handle the proxy server's requests. This needs knowledge of the client code and design, which cannot be assumed for most commercial software.
Write a wrapper for the client, which runs an a small stub on users' behalf, that can receive the requests from the proxy and service them. Since the client and the stub are separate UNIX processes, the stub has no access to the memory cache of the client and the data structures used for managing the disk cache. The index files for disk cache used by the www client program can be in proprietary format, and the stub may not always know about the format, which may make it necessary for the stub to maintain its own cache index. Also, updates to the disk cache happen asynchronously, and hence its the stub's duty to check the validity of the index that it maintains for itself. As a result, both the indices (ie the stub's index and the proxy's index) are updated lazily, and at any moment, latest information cannot be guaranteed. The proxy should be designed in order to service the request successfully, even if it discovers that information sent by a client stub is outdated and hence is unusable.

We chose to use the second option, in spite of the difficulties involved, because the first option needs access to the sources of Netscape, which we do not have, and secondly, the second option is more general purpose, and can be used with other clients with little modification.

The basic block diagram of the system is shown below :

The WWW client has to be configured for the proxy connection. Any request sent by the client first goes to the proxy. The proxy looks up the cache index in order to locate the document within the subnet and on success returns the user id of the user who has cached the document, filename and date and time of last update. If located, two requests are sent :

To the remote server, requesting the document if updated since the date of last update
To the local owner's stub, requesting a copy of the document.

If the reply from the remote server implies no change in the document since the last update date supplied, the document received from the user stub is sent to the requesting client.On the other hand, if the remote server sends the document, it is sent to the requesting client, and the proxy's index of the locally cached documents is updated.

In addition to serving the requested document, the stub also supplies a list of related documents to the proxy, that have been cached by the client. This data is used by the proxy to maintain a TLB.

Design issues

Since the individual cache is under the ownership of each user, many interesting security and protection problems come in picture. This design of DCC attempts to provide a very secure access to the cached documents. Also, keeping in mind that most clients were not designed for collaborative caching, DCC is kept completely transparent to the client. Only the proxy server and the user stubs know about the caching scheme.

Proxy's index of local cache - The index consists of three tables. The main index table, a Translation Lookaside Buffer (TLB) and a usertable.

The tuples of usertable are in the following format : <USER_UNIQUE_ID,MACHINE,PORT,COOKIE> The USER_UNIQUE_ID and COOKIE is allocated to user when he joins the group of collaborating users and thereafter the user is identified by the same number.
The main index table contains tuples in the following format : <URL,USERID,LAST_UPDATE>where USERID is an index to the user's entry in the usertable. We are considering the following approaches for managing this table.

Using Static division of the table. Each process manages for a fixed partition of the table. Different processes of the proxy server communicate with each other for resolving queries that do not belong to their own partition.
Replication of the tables on local disks of each machine on which the processes of the proxy are spawned.

The second scheme is better than the first scheme, as it avoids message passing in the critical path, but it has additional overhead of maintaining consistency of the tables and dealing with inconsistencies.

The TLB contains blocks of related URLs. The TLB is currently implemented as a hash table, with the remote server address as the key, and tuples are in the following format :<URL,USERID,FILENAME>where USERID is an index to the user's entry in the usertable. TLB has an upper limit on the size, and on exceeding the size, the last entry of a random bucket is dequeued and discarded. ( The format of TLB is also shown pictorially in this figure )

Every process of the proxy server maintains its own usertable and the TLB on every machine.

Security - This topic is divided in two categories : The broad issues of security in transmission of the documents, and the security issues specific to this design of the system.

We do not propose to change any mechanisms used by the current WWW clients/servers for ensuring secure transmission of the documents. All the proxies and the stubs are completely stateless, and their only job is to forward the requests/documents between the endpoints of the communication. Hence they will adhere to most security mechanisms used by the WWW clients. The proxy does not include the secure documents in its index, which ensures that no client can get a secure document owned by another client.
A requesting user should not be aware of the real source of the received document (ie if it was retrieved from the site described in the URL, or from other collaborating users' cache). If the document is served from other client's cache, the servicing client's identity is not revealed. For implementing the security feature, the design does not use the username at any stage. The USER_UNIQUE_ID is known only to the client and the proxy server. This allows the cache files to retain the UNIX protections and still be readable only through the proxy server. A malicious user can telnet to the port on which the proxy is listening and pretend to be the proxy, which will allow him to read the user cache through the stub. To avoid this, an initial COOKIE and USER_UNIQUE_ID exchange handshake protocol is being designed, which will ensure that only the proxy server can talk to the stub and read the contents of the cache.

The above measures ensure that while and after receiving a document, the user is completely unaware of the source of the document - either the remote server, or the local cache of other user. The UNIX protection bits also stop other users to look into the cache of any other user. There may exists a covert channel. From the time required for accessing the document, the users may conclude that the document was served form a local cache, but they will be unable to identify the user who supplied it.

Transparency - The caching scheme was designed to be totally transparent to the client and the web server. This will avoids any need to change the existing client and server software.

Availability - The proxy server is designed to be stateless. Proxy processes are spawned on multiple machines within the subnet. To ensure that all the proxy processes are always available, each server is paired up with a process called buddy. The buddy and the proxy send null signals to each other on regular intervals to check if the other process is still running. If the buddy receives a failure as a return, it concludes that the proxy process has been killed and spawns another instance of the proxy server [8].The proxy monitors the presence of buddy likewise.To ensure more reliability, heartbeat messages can be sent across machines.

We run the processes of the proxy on a set of 8 machines, under a common namespace ( eg machines in room 214 of Pond Lab are collectively called Pond-214) Specifying pond-214 as the location of the proxy provides a flexibility to run a distributed proxy and still provide an interface like a centralized proxy.

Message Passing - Network I/O is costly, and hence we decided to

Make use of the fast messaging packages available for our Network Of Workstations (NOW) platform[13]. The network is connected by a 8 port Myrinet switch. The processes of the proxy server use Illinios Fast Messages for communicating amongst themselves.
Avoid messaging from the critical path as much as possible.

Compatibility with future services - New web based applications are being developed at a very fast rate, and many of them require secure environment to operate. The proxy server is equipped with a caching-turn-off option for certain types of documents. This allows secure use of the proxy for any WWW based service that may be proposed in future.

Correctness - Most recent document is retrieved. If any client cache has an older copy of the document, it is rejected.

Client Modification - We chose not to do any client modification and wrote a wrapper around Netscape. This posed various problems that needed to be solved. Every time the stub is invoked, it contacts the proxy server and ensures that no other instance of the stub is running for the same user. If there is any, then the other instance is contacted and terminated. Since original client is not modified, certain procedures designed to support the functionality are inefficient. eg When data is transferred from a stub to other client, it has to go through the proxy server, since the client expects data only from the proxy. Although, on the positive side, the presence of the stub makes it possible to share the caches across filesystems.

Client Crash - If a user stub is killed due to any reason ( an explicit kill or machine crash) the user's cache is effectively removed from the set of collaborating caches. All the requests from the server timeout on response from the client stub. The client application can make use of the information cached in the collaborative caches irrespective of the absence of the stub. This feature can be made use of to suspend the use of any user cache in the DCC scheme. When the stub is reinvoked, it communicates with the proxy supplying the COOKIE, and the proxy records the new machine and port number of the stub.

Resource Utilization - The processes of proxy run on a number of machines within the subnet. The number of machines can be decided ( either statically or dynamically ) from an analysis of the web traffic generated by the users within the subnet. Machines need not dedicated to the proxy processes, and hence a fair utilization of resources can be expected.

Performance

This system is being developed on a Network Of Workstations (NOW) platform. We expect the performance to be comparable to that of the centralized caching schemes under low load. Due to the distributed nature, under heavy load, we expect better response times, load balancing and no hot-spot effect, with additional reliability and availability with no significant dedicated hardware requirments.

Although the current implementation is for the NOW platform, it can be implemented on any network, which may result in a slight degradation of performance.

Summary and Conclusions

The web is moving towards a cached and mirrored architecture. Caching clients, servers and intermediate agents are coming in common use. A robust design of these components is therefore important. NOW is a fast and reliable platform for providing these components The DCC scheme described above runs on a network of workstations, can provide performance comparable to centralized caching proxy servers, and stilll be highly available and reliable.

References

[1] Marc Abrams et al, Caching Proxies : Limitations and Potentials, Computer Sc Dept, Virginia Tech, Blacksburg, VA 24061-0106 USA

[2] Michael D. Dahlin, Randolf Y. Young, Thomas Anderson, David Patterson, Cooperative Caching : Using remote Client Memory to Improve Filesystem Performance, OSDI 1994.

[3] Chanda Dharap and Mic Bowman, Rudimentary Type Analysis of Wide-Area Accesses, Tech Report: TR CSE-96-044, Department of Computer Science and Engineering, The Pennsylvania State University.

[4] Chanda Dharap and Mic Bowman, Preliminary Analysis of Wide-Area Access Traces , Tech Report : CSE-95-030 , The Pennsylvania State University

[5] Chir Ben Abdelkader, A Prefetching Scheme for the World Wide Web, MS Thesis, Department of Computer Sc and Engg, The Pennsylvania State University, 1997.

[6] Cache Now! Campaign http://vancouver-webpages.com/CacheNow/detail.html

[7] Various technical reference pages, Netscape Communications Corporation, http://www.netscape.com

[8] Yennun Huang and Chandra Kintala, Software Fault Tolerance in the Application Layer, Chapter 10, Software Fault Tolerance, Edited by Michael R. Lyu, John Wiley & sons

[9] S.Pakin, M.Lauria, and A.Chien, High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet, Proceedings of Supercomputing '95, December 1995.

[10] T.von Eicken, A.Basu, V.Buch, and W.Vogels, U-Net: A User-Level Network Interface for Parallel and Distributed Computing, Proceedings of the 15th ACM Symposium on Operating System Principles, December 1995.

[11]T.von Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser, Active Messages: A Mechanism for Integrated Communication and Computation, Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 256--266, May 1992.

[12] N.J. Boden et al. Myrinet: A Gigabit-per-second Local Area Network, IEEE Micro, 15(1):29--36, February 1995.

[13] T.Anderson et al. , A case for networks of workstations, IEEE Micro, pages 54--64, February 1995.

Acknowledgements - We thank Dr. Anand Sivasubramaniam and D r. Thomas K eefe for their encouragement and help.

Return to Top of Page
Return to Posters Index