WCP — a tool for consistent on-line update of documents in a WWW server

Sampath Rangarajan^a, Shalini Yajnik^a and Pankaj Jalote^b

^aBell Laboratories, Lucent Technologies,
600 Mountain Ave., Murray Hill, NJ, U.S.A.
sampath@lucent.com and shalini@lucent.com

^bInfosys Technologies, Bangalore and Indian Institute of Technology,
Kanpur, India
jalote@iitk.ernet.in

Abstract: With the growing use of the Web infrastructure as an information provider, it has become necessary to consider the problem of accessing documents from a WWW server in a consistent fashion. When a group of related documents that are very frequently updated are accessed, it is desirable to provide a access semantics in which either only old copies of all the documents in the group are provided or only new copies of the documents are provided. As Web servers are stateless, they do not maintain information about the sequence of documents that are retrieved. Hence, this type of semantics is not provided for a group of documents, and special measures need to be taken to support such semantics. We describe a tool WCP (Web-Copy) that facilitates the on-line update of documents such that the above document access semantics is ensured within a persistent HTTP connection without making any changes to the WWW server program.
Keywords: On-line update; Document groups; Group consistency; WWW servers

1. Introduction

The World Wide Web (WWW) has revolutionized the way we think of computing environments. And there is no doubt that this revolution is here to stay and grow. Though there are many possible and potential ways in which the Web infrastructure can be deployed, the main use of the Web currently is to provide a massive information and commerce service. In this application, a client specifies the document object it wants through a URL, and through the HTTP protocol the document is then read and the information on the server is passed to the client. This type of access by the client is generally strictly read only. Multiple clients can ask for the same document from a server. The document objects on the server are maintained and updated by whoever controls the server, say the Webmaster. If the information on the server needs to be updated, this will have to be done by the server controller.

If a document object is updated without any consideration for the HTTP connections accessing the document on behalf of the clients, it can be shown that the information provided to some clients that are accessing the document while the update is being performed can be inconsistent (i.e. the information is neither new nor old). It seems that the current update approaches: (a) exercise no control, i.e., do not worry about a few clients getting inconsistent information or (b) make the service unavailable for sometime while the update is being done, are acceptable in most situations. The approach of making the information unavailable will not suffice if the server is updated frequently with new information (e.g. any server providing real-time information like stock quotes, airline arrival and departure information, etc.), and the server is heavily in demand. In the cases where it is important for the client to get consistent information, e.g. stock quotes, then the approach of not taking any precaution while updating will not be sufficient. Clearly, for those servers which are heavily in demand and would like to provide consistent information to the clients at all times, there is a need to provide mechanisms to perform on-line update of information such that correct and consistent information is provided to clients without disrupting the service. Although, currently the need for consistency is not perceived as significant, as the demand on some of the servers grow and the information on these servers becomes more dynamic, we believe that demand for consistency will also arise. The information providers will become more interested in providing consistency, if it can be achieved at a low cost. In this work, our aim is to bring forth the existence of the problem and to provide a simple solution for it that does not require any changes in the HTTP protocol or the HTTP server and the user agent code.

The problem of consistency of information arises when some logically single piece of information is provided through a collection of objects. For example, an article or a book which is a logical piece of information may be provided as a collection of different chapter documents. To provide consistency to a group of documents, we define a logical session. A logical session consists of a number of requests by a single client for a set of logically related documents. Currently, the HTTP protocol (HTTP/1.0 [1]) is such that in each TCP connection one document is transferred. The stateless property of the HTTP servers makes them respond to each client request without relating it to any other request. However, for placing a request in the context of a session, we need to have some state information for a group of related requests. Due to the lack of a state management mechanism [3] in the currently implemented HTTP protocols (HTTP/1.0 and HTTP/1.1), we will use persistent HTTP connections as a substitute for logical sessions, i.e. we will provide consistency for a group of documents accessed over a single persistent HTTP connection.

Unlike the HTTP/1.0 protocol, the newer version of the HTTP protocol — HTTP/1.1 [2] — by default allows any number of document objects to be transferred in one TCP connection, until the connection is closed either by the server or the client. Although these persistent connections were meant to reduce TCP overhead, another consideration for the need for a persistent connection might be one which acknowledges the fact that different objects in the server are not always independent and some objects may form a logical group. With this model, the unit of consistency of information may become a group of document objects, and each persistent connection can be treated as a read transaction, reading many objects. As is well known from the database context, when multiple objects are logically connected, reading and updating objects becomes more complex. Our goal in this work is to develop schemes for on-line updates of document objects such that consistent access of these documents within a single persistent HTTP connection is ensured. We also describe a tool — WCP — that incorporates the on-line update schemes.

In the next section, we discuss the system and consistency model that we consider and describe an analytical model that provides a quantitative handle on the consistency problem. Section 3 gives a brief overview of the features in the HTTP protocol that are used by WCP. Section 4 describes in detail our on-line update schemes and Section 5 describes the WCP tool. Conclusions are presented in Section 6.

2. System model and consistency

The following characterizes our model of the Web being used as an information service. Though the Web can also be used to perform update operations on server data, through forms and CGI programming, we are restricting our attention to the case when the Web is used as an information provider with the following characteristics.

One server and multiple clients. When the Web is used as an information service, there is usually one information provider and several clients accessing that information. That is, there is one writer process, and multiple readers which read the documents. The reader issues requests to read one document at a time, even if a group of documents are logically related or even when documents are embedded within a document. This is a fundamental difference from the general database model.
Multiple documents in a server may be related and an update may require change to many of the documents on the server e.g. a server may contain files representing many chapters or components of a document.
The server side process serves all the incoming requests using the HTTP protocol. The HTTP server reads a configuration file at initialization and this configuration file provides all the parameters to set up the internal state of the server.

Consider the following example which illustrates how uncontrolled updates can lead to clients getting inconsistent information. Suppose a document f₁ contains embedded documents f₂, f₃, f₄ and f₅. Documents f₁ through f₅ form a logical group. If a client issues a HTTP request for document f₁, the client side browser first fetches document f₁. Then it fetches the embedded documents f₂ through f₅, one at a time. Assume that at the same time on the server side, documents f₁, f₂, ..., f₅ are updated between the time the file f₂ is fetched at the client and a request for f₃ is sent to the server. Now, when the client side browser fetches document f₃, it will get the new updated version of document f₃. This may have some information which is inconsistent with the old versions of f₁ and f₂. In the above scenario, in order to keep the information given to a client consistent we would like the client to get either all the old copies of the documents or all the new copies of the documents in this logical group. This is one example of a simple way to define a group. This simple example can be solved by replacing the references to the embedded documents with the correct version before serving the document to the client. However, there can also be documents which are not related through being embedded in a document but still may form a logical group. The information provider may define the dependencies between documents and hence form a group. Let us now define the notion of consistency in this general context of a group.

Group Consistency:: When a client is reading a logical group of related documents, the consistency requirement is that the client either gets old copies of all the documents or new copies of all the documents. A logical group of documents is taken to be one where all the documents in the group are read within one persistent HTTP connection. This is like the transaction model which requires atomicity.; Note that in the above definition, we couple the object model with a persistent HTTP connection and define a group of documents to be those documents that are provided over one single persistent HTTP connection. We found that this is one definition that is implementable without changing the server code. There may be other stronger definitions that may be possible but those will require changes to the server program. The server will have enough information about a group and can keep a persistent connection open until all the documents in the group are transferred. But the server has no control over when the client may choose to close a connection. Thus, the server only provides the guarantee to the client that if the client does not close the connection before all the documents in the group are transferred, then the documents will be consistent.
Timeliness:: Once the information is updated, then all the new client requests must get only the new information.

Note that the consistency and system model are not the same as in the database context. There is only one writer process and there are no update type transactions that might have to be ``undone''. The database transaction type of model is not suitable for the Web. Furthermore, the cost of implementing a database transaction type model is known to be high. One of the reasons for the popularity of HTTP is that it was designed to be a very light-weight protocol. Therefore, any solution for ensuring consistency in information should be such that it does not add substantial overhead to the communication between the client and the server. Only if the solution has minimal overhead will it be acceptable to the WWW community. In addition, the solution should be limited to either not modifying the HTTP server at all, or allowing limited modifications to the server. Further, no modifications should be required on the client side, since client side changes require modifications in the browsers. In addition, the solution should not suggest any changes to the HTTP protocol. Our solution in this paper does not require any changes to the server program.

Before we discuss the solution let us better understand how uncontrolled updates can affect consistency. We developed a simple analytical model for a Web server where there are highly accessed documents (for example, stock quotes which may be accessed very frequently) and uncontrolled on-line updates of these documents is being done. We assume that when a group of documents is being updated it is done by updating one document at a time (perhaps by rewriting the documents with the new document). This approach will not satisfy the group consistency requirement. The analytical model computes the probability that at least one client is accessing the group of documents while the uncontrolled update is in progress. This probability is referred to as the interference probability. A group of documents may be a document and its associated embedded documents and we refer to these documents collectively as a group.

Using actual measurement data for time needed to move a file and read transfer times for groups of different sizes (our measurements were done on SPARCstation 20 running Solaris), we computed the interference probabilities for different file sizes and read rates, using our model. As expected, we found that the interference probability increases with file size, and decreases almost linearly with decrease in read request rate. For example, for a group size of 100 Kbytes with read requests arriving at a rate of 0.1 request/sec., 2% of the time a read request will interfere with an update and may lead to inconsistent data being read. For large group sizes of 5 and 10 Mbytes, it can be seen that the interference probability is more than 50% if the read request rates are high. This shows that in systems where constantly changing documents are accessed at a high rate, there is a very high probability that at least one read of a document in a group, interferes with an update which may lead to inconsistent data being retrieved. Therefore, if we need to satisfy the group consistency requirement for a set of documents, we require some mechanisms for update which can provide either all new copies of the documents or all old copies of the documents even if an update is initiated while the client request is being processed. The details of the model and the results of this modeling can be found in [7].

Before going into the details of the scheme, we provide a short description of HTTP servers and the redirection facility which is a part of the HTTP specification.

3. HTTP

HTTP is a light-weight application level protocol for distributed information systems. It is a request/response protocol. Most servers and browsers currently support HTTP/1.0 version of the protocol. The newer version HTTP/1.1 is still under standardization by the IETF. In the HTTP protocol, the client sends requests to a server. A request contains a request method and the URL being requested, along with some other information. The server responds back with a status code and a message containing the data in the URL requested by the client. If a client wants to make requests for a set of documents/URLs from the same server, in the HTTP/1.0 version of the protocol it has to make a separate TCP connection for each URL. However, this is remedied in HTTP/1.1 version of the protocol. HTTP/1.1 specifies that a connection is persistent by default unless explicitly closed by either the server or the client. While the persistent connection is open the client can ask for any number of documents on the same TCP connection. We use this to provide a form of transaction semantics for a group of documents.

The HTTP protocol supports a class of status codes called the Redirection codes which indicate to the client that the client side browser needs to take another action to fulfill the request. For example, consider a scenario where a document d has been moved from server s₁ to server s₂. When a client requests document d from server s₁, s₁ returns a status code which tells the client that the document has moved permanently to another location and it also includes the URL for the new location on server s₂ in the reply. The client side browser can then make a request to s₂ using the returned URL. All HTTP servers implement the redirection functionality.

Most HTTP servers use a configuration file to configure themselves at initialization or at any time when a change in the configuration is needed. The configuration file may consist of different directives which configure different dimensions of the HTTP server, e.g. port number for incoming requests, log file name, timeout for persistent connections. We used an internal Bell Laboratories developed HTTP server for our experiments. This server uses Rewrite File url-pattern replace-pattern rules in the configuration file to translate url-pattern in the requests to the document or directory in the replace-pattern on the server. The server compares any incoming URL request against each url-pattern in the order in which they appear in the configuration file and replaces the matching part with replace-pattern and the resulting document is then returned to the client, e.g. Rewrite File /* /usr/local/public_html/directive means that an incoming URL, say http://www.lucent.com/file1.html will be evaluated to /usr/local/public_html/file1.html. The server also provides the redirection facility in the HTTP protocol by adding a Rewrite Redirect url-pattern replace-pattern directive to the configuration file. In this case, the replace-pattern is the URL of a document on another server and the client is sent back the URL in the replace-pattern. If a change in the configuration of the server is desired, the configuration file is modified and then a hangup signal is sent to the server, which then rereads the configuration file and resets its internal parameters as a part of the signal handler routine.

Our solution is geared towards servers which use the process forking model to service requests. The process forking mechanism works as follows:

1.: Initially, when the Web server is started, it reads the contents of the configuration file into its data memory. Once this is done, this configuration information is used by the server. If a change in the configuration of the server is desired, the configuration file is modified and then a hangup signal is sent to the server, which then rereads the configuration file and resets its internal parameters as a part of the signal handler routine.
2.: When a client HTTP connection request is received at the server, it forks a child process to serve this request, and then goes back to receiving further client requests.
3.: When a child process is forked, it is provided with a copy of the data memory which includes the configuration data. Since the child makes a copy of the data, even if the configuration data gets changed in the main server process due to a reread of the configuration file, the copy maintained by the child process is unchanged.

4. Redirection based document update

To provide group consistency, one conceptually simple approach would be to copy the original documents onto backups and change the backups. Once all the backups have been changed, an atomic operation should be used to inform the server of the updated documents. If we then let the original copies and the updated copies of the documents in the group co-exist for some time, and satisfy requests over a persistent HTTP connection that was initiated before the atomic update by providing the original copies and satisfy any requests over a persistent HTTP connection that was initiated after the update by providing the updated copies, then clearly group consistency will be satisfied in that a read request to a group of documents within a single persistent HTTP connection will get either the old copies of the documents or the new copies of the documents. We also need to make sure that once a request is serviced with the updated copies, all subsequent requests will be serviced with updated copies.

We use the process fork option and the above mentioned properties to devise a simple scheme which uses the Rewrite directives of the experimental Web server and the redirection facility of HTTP to implement this approach for ensuring consistency within persistent HTTP connections. The goal is to deliver, within a single persistent HTTP connection, either all old copies of the documents in the group or updated copies of the documents. In this scheme, when a group is to be updated, the original documents in the group are not changed, but new copies of all the documents in the group are created in a temporary location and these are then updated, thereby giving the updated documents. The configuration file of the server is changed to redirect all subsequent requests for any document in the group to the updated copies. Then, the server is made to reread the configuration file so that subsequent child processes that are forked to serve client requests use the changed configuration information and hence redirect the requests to the updated copies. Note that the new configuration file should have only a few lines of Redirect directives and hence the reread/restart at the server will be very quick. In the mean time, child processes that were forked off before the server was made to reread the configuration file would continue to serve the old copies of the document until the persistent HTTP connection is closed. After all the persistent HTTP connections, supported by child processes forked before the configuration file is reread, are closed, the old copies are replaced by the updated copies, the redirection is removed from the configuration file and the server is made to reread the configuration file. This is done so that the configuration file does not continually grow when more and more updates are performed. Subsequent client requests will then get served the new copies of the documents. Using the above approach, several versions of a document can be maintained on a server at the same time.

Even in this simple scheme, the details of the update scheme change depending on the temporary as well as the permanent location of the updated copies. In the following subsection, we describe the steps to be followed for two different scenarios outlined below.

1.: Same server: In this scenario, the original copies of the group of documents to be updated as well as the temporary location of the updated copies are in the same server. The temporary location of the updated copies could be in the same directory as the original copies or could be in a different directory and the same solution works for both.
2.: Fully replicated document tree: In this scenario, the whole document tree is replicated on more than one server which means that temporary copies of the documents to be updated are not explicitly created. Already available replicated copies of the documents on another server are used to perform the update. However, as full replication can be quite expensive, especially if only a few documents are to be changed, this scheme is likely to be of use only when there is already replication of information for performance reasons. Many servers whose information is highly in demand replicate their documents on a number of servers. In such a case, one of the replicated copies of the document can be updated and then the updated document can be copied to the other copies one copy at a time.

The details of our solution for the above two scenarios is discussed below.

4.1. Same server

The first scenario is where the original copies and the updated copies are both on the same server. Assume that we need to update the group of documents D₁/f₁, D₂/f₂, D₃/f₃, ... on a server, where D_i represent the directory path of document f_i. Assume that the updated copies for each of these documents are created. Let the update copies be D'₁/f'₁, D'₂/f'₂, D'₃/f'₃, ..., respectively. The following steps have to be performed for consistent update.

1.: For each document f_i, add the following line in the configuration file of the server: Rewrite File /D_i/f_i /D'_i/f'_i.
2.: Send a signal to the server to read the configuration file.
3.: For each document f_i, when all the persistent HTTP connections opened before the signal was sent to the server (in Step 2) are closed, copy the new updated documents /D'_i/f_i' over the old documents /D_i/f_i.
4.: In the configuration file, delete the lines added in Step 1, and send another signal to the server to re-read the configuration file.
5.: When all the persistent HTTP connections accessing documents in the group D'_i/f'_i, are closed, this group of documents can be deleted.

As explained earlier, by making use of the fork option of a Web server, the above steps will ensure group consistency. Before the server is sent the signal to reread the configuration file, a persistent TCP connection has been established between the clients and the server child processes with these processes reading the old documents in the group and writing the contents of the documents to the TCP socket. After the signal is sent and the server has read the new configuration file, any new HTTP requests arriving at the server for the documents in the group will be serviced by newly created child processes that use the new configuration information and hence provide the updated copies D'_i/f'_i. As no new requests will be provided the old documents, after some time all the ongoing requests will be serviced and there will be no child processes reading the old documents. At this time, these documents can be written over, as done in Step 3. By the same argument, in the last step, the documents can be deleted. Note that the consistency condition is satisfied by the first two steps. However, if we just leave there, the configuration file will keep getting longer every time a document is updated, as old unused copies of the documents will be retained. Steps 3 through 5 avoid this problem. The steps described above implement the conceptual semantics of a group update — documents in a group are updated in place with new information.

4.2. Fully replicated document tree

In this section, we consider the second scenario where a document tree is fully replicated on other servers. In such a case, it may be useful to update one of the replicated copies of the document and then copy the updated document to the other replicas one copy at a time thereby staggering the updates at the different replicas. This scheme uses the Rewrite Redirect directive of the HTTP server, instead of the Rewrite File directive used by the previous scheme. The Rewrite Redirect directive allows us to redirect requests to another server unlike the Rewrite File directive which can provide substitution only at the file level on the same server.

Assume that a document f in directory D has k copies one on each of the servers w₁, w₂, ..., w_k. Since these copies already exist, it is assumed that the relative links in the copies have all been created keeping in view the fact that the document is on a particular server. We need to update all the copies of the documents and we perform a staged update as follows:

for i = 1 to k do

1.: Consider document /D/f on server w_i which is to be updated.
2.: Add an entry Rewrite Redirect /D/f http://w_(i%k)+1.lucent.com/D/f to the configuration file at server w_i. Send a signal to the server w_i to re-read its configuration file.
3.: After all existing persistent HTTP connections at server w_i which are accessing the group of documents that /D/f belongs to, are closed, update document /D/f.
4.: Delete the line added to the configuration file of server w_i in Step 2; send a signal to server w_i to re-read its configuration file.

end do

In Step 2, we redirect all requests to /D/f at server w_i to its replica on server w_i+1. In Step 3, we update the document at server w_i and in Step 4, we remove the redirection from server w_i to server w_i+1 so that subsequent requests to /D/f at server w_i is provided the updated document /D/f at server w_i.

5. WCP — a tool for consistent updates

WCP (Web-Copy) is a tool that implements the redirection based solutions discussed above. The tool can be used for updating a group of files for the two scenarios discussed earlier, (a) where the updated copies are located on the same server; or (b) the document tree is fully replicated.

5.1. Same server

In this case, the WCP command should be invoked as follows. Assume that documents f₁, f₂, f₃, ..., need to be updated. Assume that the updated documents have been created and placed in f'₁, f'₂, f'₃, ..., respectively. Then, the user can invoke wcp using the command:

wcp f'₁!f₁ f'₂!f₂ f'₃!f₃ ...

Each document name is expressed as a Unix file path with the target and destination separated by a "!". The paths could be relative to the current working directory or they can be absolute paths. For each pair of documents the wcp utility first evaluates the absolute paths in the file system. Then it evaluates the path of the original document relative to the Web directory path. For example, if the HTTP server's home directory is /usr/local/www/htdocs and the original document is /usr/local/www/htdocs/greetings.html, then the original document's path with respect to the Web directory is greetings.html and the document is accessed by the URL http://serveraddress/greetings.html. If the updated copy of the document has an absolute path of /usr/local/www/htdocs/temp/newgreetings.html, then the WCP utility adds the following line to the configuration file: Rewrite File /greetings.html /usr/local/www/htdocs/temp/newgreetings.html. After the server receives the signal to reread its configuration file, for every request that comes in for URL http://serveraddress/greetings.html, it returns the document /usr/local/www/htdocs/temp/newgreetings.html.

5.2. Fully replicated document tree

In this case, the WCP command should be invoked as shown below. Assume that documents f₁, f₂, f₃ on server www1 need to be updated and copies of this document are replicated on server www2. Let the replicated copies of the documents on www2 be located at f'₁, f'₂, f'₃ respectively. Then the user can invoke wcp on www1 using the command:

wcp www2:f'₁!f₁ www2:f'₂!f₂ www2:f'₃!f₃

Again, each document name is expressed as a Unix file path. In this case, the path names of the files on www1 can be specified relative to the current working directory or can be absolute paths, but the path names of the files on www2 need to be specified as absolute file paths.

5.3. Experiments with WCP

We have tested wcp for both the above scenarios using the experimental server developed at Bell Laboratories which allows the static definition of persistent HTTP connections through the use of keepalive parameters in the server configuration file (htd.conf). The syntax for specifying keepalive parameters is the following:

Keepalive on|off [ timeout [ max.requests ]]

where the on|off flag can be used to turn the Keepalive specification on or off, timeout is the maximum pause in seconds that is allowed between the end of one request and the start of the next beyond which a persistent HTTP connection will be closed at the server and max.requests is the maximum number of document requests that will be served over a persistent HTTP connection before it is closed by the server. For example, if the line Keepalive on 50 5 is included in the configuration file, when the server reads the configuration file, it will set its internal parameters so that a HTTP connection will be kept open until no new request arrives for more than 50 seconds or until 5 documents are transferred, whichever happens first.

No changes were made at all to the server program. In the basic experiment that we setup we sent a continuous stream of requests for reading a document with a group of embedded documents in it. The main document together with the embedded documents now form a group. The keepalive parameter was specified such that a connection is kept open at the server until the number of documents that are retrieved over this connection before it is closed equals the number of embedded documents plus the main document. When the requests were being serviced, one or more of the documents in the group were updated. Then, the documents that were retrieved were compared to the old group and the new group of documents. In all the cases we found that either a request retrieved the old group of documents or the new group of documents.

5.4. Extra utility in WCP

In this section, we describe an additional scenario where wcp can be used. It is generally known that in a Web site, most of the document accesses are to a small number of ``hot'' documents. This means that performance can be improved by replicating just a small set of documents and not the whole document tree. This idea has been used in the design of geographically distributed document caching schemes [4][5]. Consider the temporary replication of these hot documents on another server to ease the load on the original server. In this case, replication is not necessarily done for an update but to increase the performance by having more replicas for a temporary period. As all documents are not fully replicated, relative URLs pose a problem and have to be handled. Relative URLs are those that specify the location of a document relative to the main document in which they are embedded. For example, assume that a document file1.html, that is available in server www1 and can be accessed from it using the URL http://www1.lucent.com/file1.html, is getting a large number of requests. The document is replicated on server www2 and some requests to the document are redirected to www2 to balance the load. In file1.html, there may be a link to a document file2.html whose URL is specified relative to the directory where file1.html resides. If this document is not a hot document, this may not be replicated on www2. If a client accesses file1.html from www2 and then if the link to file2.html is selected, the relative URL will be transformed to the complete URL http://www2.lucent.com/file2.html. However, file2.html is not replicated on www2 and hence this access will fail. In this case, such relative URLs have to be handled by redirecting these requests to the server www1. WCP has the capability to handle relative URLs and hence can be used in a situation where documents are temporarily replicated to handle load increases. One such use could be found in the RobustWeb system that we have developed. This system is described in [6]. In RobustWeb, a front end redirection server redirects HTTP requests probabilistically to one of the back-end servers which has a copy of the document. Documents are statically replicated and distributed. In this system, if we allow documents to be dynamically replicated and distributed to handle temporary increases in the load on a server, then the above mentioned relative URL problem has to be handled. We are currently exploring this possibility.

6. Conclusions

When a group of related documents is provided on a Web server to be accessed by clients, it is possible for clients to get inconsistent information if the files are updated without any control at the server. That is, if the file updates are not controlled, it is possible that a client may get some files that are old and some that are new. If the group of files is related, this situation leads to clients getting inconsistent information.

In this work, we discuss the problem of supporting consistency on the Web, and provide a simple solution to the problem. We describe a tool called WCP that can be used to update a group of Web documents such that accesses to these documents over a persistent HTTP connection are consistent. That is, a stream of accesses retrieves either all old copies of the documents or new copies of the documents. The tool can be used on its own and does not require changes to either the HTTP protocol or to the WWW server program. In this work, we have proposed a very simple solution which is geared towards a particular category of Web servers and our solution provides group consistency over a single persistent HTTP connection. We are currently exploring other possible solutions which will provide consistency in the context of a logical session rather than a physical session.

References

T. Berners-Lee, R. Fielding, and H. Frystyk, Hypertext Transfer Protocol, HTTP Working Group Informational document, RFC 1945, May 1996, http://www.ics.uci.edu/pub/ietf/http/rfc1945.ps.gz
R. Fielding, J. Gettys, J.C. Mogul, H. Frystyk, and T. Bernes-Lee, HyperText Transfer Protocol – HTTP/1.1, HTTP Working Group Proposed Standard, RFC 2068, Jan. 1997, http://www.ics.uci.edu/pub/ietf/http/rfc2068.ps.gz
D.M. Kristol, and L. Montulli, HTTP State Management Mechanism, HTTP Working Group, Proposed Standard, RFC 2109, Feb. 1997, http://www.ics.uci.edu/pub/ietf/http/rfc2109.txt
J. Gwertzman and M. Seltzer, The case for geographical push-caching, HotOS`95, 1995, http://www.eecs.harvard.edu/~vino/web/hotos.ps
A. Bestavaros, Speculative data dissemination and service to reduce server load, network traffic and service time in distributed information systems, in: Proc. of the International Conference on Data Engineering, March 1996, http://www.cs.bu.edu/~best/res/papers/icde96.ps
B. Narendran, S. Rangarajan and S. Yajnik, Data distribution algorithms for fault-tolerant load balanced web access, in: Proc. of the IEEE Symposium on Reliable Distributed Systems (SRDS'97), October 1997.
S. Rangarajan, S. Yajnik and P. Jalote, WCP – a tool for consistent on-line update of documents in a WWW server (extended version of this paper), http://www.bell-labs.com/~shalini/www7conf/journal.html

Vitae

Sampath Rangarajan is a member of technical staff in the Distributed Software Research Department at Lucent Technologies Bell Laboratories in Murray Hill, NJ. Prior to joining Bell Laboratories, he was an Assistant Professor in the Department of Electrical and Computer Engineering at Northeastern University in Boston, MA. He received a Ph.D in Computer Sciences from the University of Texas at Austin in 1990. His research interests are in the areas of fault-tolerant distributed computing, mobile computing and performance analysis.

Shalini Yajnik graduated with a Ph.D. degree from Princeton University, Princeton, NJ, in 1994 and has since been working as a Member of Technical Staff in the Distributed Software Research Department in Lucent Technologies, Bell Laboratories. Her research interests are fault tolerance in distributed systems, CORBA, and fault tolerance and load balancing in WWW servers.

Pankaj Jalote received his Ph.D. in Computer Science from University of Illinois at Urbana-Champaign in 1985. From 1985 to 1989 he was an Assistant Professor in the Department of Computer Science at the University of Maryland, College Park. Since 1989 he has been at Indian Institute of Technology Kanpur, where he is now a Professor in the Department of Computer Science and Engineering. Currently he is on a two year Sabbatical with Infosys Technologies Ltd., a leading software company in Bangalore, India, as Vice President. His main areas of interest are software engineering, fault tolerant computing, and distributed systems. He is the author of two books – An Integrated Approach to Software Engineering (Springer Verlag, 2nd ed., 1997), and Fault Tolerance in Distributed Systems (Prentice Hall, New Jersey, 1994). He is a senior member of IEEE.