In this paper, a distributed Web and cache server called MOWS is described. MOWS is written in Java and built from modules that can be loaded locally or remotely. These modules implement various features of Web and cache servers and enable MOWS to run as a cluster of distributed Web servers. In addition to its distributed nature, MOWS can integrate external services using its own external interface. Java programs conforming to this interface can be loaded locally or remotely and executed at the server. The resulting system will potentially provide effective Web access by both utilizing commonly available computing resources and offering distributed server functionality. Design considerations and the system architecture of MOWS are described and several applications of MOWS are described to show the benefits of MOWS.
The use of World Wide Web (WWW) [1] has been growing at an enormous rate since its introduction. The WWW is used not just at academic or research institutions but also at business corporations and government institutions. In the recent years, the massive explosion of WWW usage has created several performance problems of the WWW. Servers may often be overloaded with too many requests and the network may be congested with too much traffic. One way to solve this performance problem is to use multi-threaded and distributed techniques. Multi-threaded techniques offer efficient connection handling for servers, and distributed techniques offer effective use of scattered resources. Furthermore, distributed techniques offer more than just a performance benefit. The WWW was originally used for distributing static documents or interfacing with existing external information systems. As the WWW technology evolves, there is a growing amount of interests in developing new distributed web applications [2,3, 4,5] that do more than just delivering documents.
MOWS is a distributed web and cache server written in Java. It was developed as a portable web and cache server that can act as a cooperating server. It employs a set of modules whose basic structure is designed after that of Roxen [6]. These modules implement various server functions, and new features can be easily added by extending one of the existing modules. Modules can be dynamically loaded from the server's local system or from a remote system. Among the features that are supported by these modules are CGI, HTML file filter, image map, memory cache, proxy, disk cache, and redirection. Two modules of particular interest in the context of distributed computing are the HTTP proxy module and the Java MOWS extension directory module (JME). The HTTPProxy module can communicate with other MOWS and Harvest cache servers for cooperative caching using the HTTP [7] and ICP [8] protocols. The JMEDirectory module is similar to what the cgi-bin directory does for CGI scripts [9]. There are however two significant differences. First, the target directory can be either local or remote. When the directory is local, the requested code is fetched from the local system and executed locally. When it is remote, the code is fetched from the remote system and subsequently executed locally. Second, the code is written in Java and it is executed by a thread rather than a process. The CGI variables are made available to JME programs to make the translation of traditional CGI scripts to JME programs simple. With JME, one can write a stateful gateway to an external program or create a program that can be transferred to another server as code, executed on that server, and its output is sent to the client. In this way, a distributed web and cache system can be formed by several cooperating MOWS servers.
The resulting system will potentially provide more efficient access of web resources by utilizing commonly available computing resources. Trends in high performance computing favor clusters of workstations and PCs equipped with commodity processors [10]. They can utilize distributed resources efficiently and bring the highest performance per cost ratio. These heterogeneous computing resources are placed into a uniform platform by Java, a simple, multi-threaded, object-oriented, architecture neutral language [11]. MOWS can run on these system resources as distributed web and cache servers. The performance of the current implementation is limited by its non-optimized code. Furthermore, the interpretative execution of the Java byte code instructions introduces some penalty. This situation is expected to change when Java "just in time" compilers become available. The current goal of MOWS is to prove the basic concept of a portable extensible server in Java and to experiment with distributed web and cache server techniques.
The following list summarizes the characteristics of MOWS:
There are several high performance web and cache servers currently on the market. The Netscape server [12, 13], Roxen [6], and Apache [14,15] do not fork for each connection and thus provide much higher throughput than the original CERN server. These servers are organized in modules that conform to proprietary Application Programming Interfaces (API). Modules are extensible and users may extend the functionality of their server by adding a new module. However, most of these systems are architecture dependent and run usually on Unix based systems. This architecture dependency not only makes software installation and distribution complex but also makes dynamic code loading from a remote system impractical.
These servers can also be used as proxy cache servers. Several cache servers may be cascaded to form a hierarchical cache structure. However, each server in the hierarchy acts as an individual cache and does not cooperate with other caches. In contrast, the Harvest object cache [16] is a distributed cache system, in which cache servers are organized in a hierarchical structure with siblings and parents. Each server in the hierarchy can cooperate with its siblings and parents. When a cache miss occurs, the server asks its siblings and parents for the requested file. If the file is available at one of these servers, it can be fetched by the requesting server. If the file is not available, the server will fetch the file from a parent or from the original host if there is no parent. In this way, each server in the hierarchy acts as part of a large distributed cache system. Network resources are utilized by distributing cache resources at many locations. The Harvest object cache runs on most Unix systems. However, the architecture dependency introduces the drawbacks mentioned earlier.
Since its introduction, the Java language [11], has received substantial interest as a platform for portable, multi-threaded, distributed applications. Programs written in Java can run on any system that supports the Java Virtual Machine (JVM). Currently, Java programs are interpreted by the Java interpreter and run on the average an order of magnitude slower than their compiled C counterparts. However, when Java "just in time" compilers become available, Sun claims the compiled code will be as fast as its native C version.
Jigsaw [17] is an object oriented server written in Java. It supports the HTTP/1.0 and 1.1 protocols [7]. It uses modules to implement various server functions including authentication filtering and simple proxying. Several interesting ideas have been tested in the development of Jigsaw. Extensive data caching and thread pooling techniques are used to produce good performance. However, it does not support distributed caching nor remote module or code loading.
The following section of this paper describes the system architecture of MOWS. Section 3 gives a performance analysis and discusses advantages and disadvantages of the current system. Several applications are described in Section 4. Section 5 concludes and gives future plans.
The server functions of MOWS are implemented through its modules. The basic organization of modules follow the design of Roxen [6]. The Roxen server, originally known as the Spinner server, consists of a set of modules, each responsible for performing some server function. Modules may be extended, replaced, or omitted to configure the server's behavior. These modules are grouped into a set of basic types, each of which characterizes a basic feature provided by a typical web server. MOWS also uses modules that are organized in groups. Each of these modules has a basic module type that characterizes its use. When the server tries to perform a task that needs to be handled by a certain type of module, it looks up among its loaded modules an appropriate module of that type. This module then performs the task directly or by using several other types of modules.
More precisely, MOWS handles each HTTP request in three phases. The first phase allows the request to be preprocessed by matching the request to a list of patterns. The module with a pattern that matches the request is used to process the request. In this phase, the Redirection module or its subclasses may be used. One may write a conditional or probabilistic mapping module by extending the Redirection module. When no matching module is found, the request is passed to the second phase. In the second phase, the requested URL is matched against a list of mount points and the module with the matched mount point is used to process the request. Each module used in matching is a subclass of the Location module. Among the modules that belong to this type are the FileSystem, CGIDirectory, JMEDirectory, and HTTPProxy modules. For each HTTP request, exactly one of these modules is chosen and this module may use other modules of other types to process the request. Modules that are used in this way include subclasses of the ContentType, Directory, Extension, MemoryCache, DiskCache, and UserDatabase modules. Finally, after the request is processed, it is passed to the third phase for any post processing. The Logging module may be used in this phase to write access logs to a file.
Specific modules are defined as subclasses of base modules and support features such as CGI, file system with basic authentication, user file system, image mapping, HTML tag parsing, HTTP proxy, and a Java extension interface.
MOWS is configured in a configuration file. When the server is started, the modules specified in the configuration file are loaded dynamically. Subclasses of the base modules may be loaded from a remote system. Remote modules are specified by module name and remote host name. When loading a module or class remotely, MOWS first searches the local system. If there is a local copy, it is loaded instead of the remote copy. When no local copy is found, MOWS loads the remote copy using the HTTP protocol. When recursively loading classes that are referenced in the remote class definition, the same loading rule is used.
Remote loading of modules offers the advantage of automatic software updating and distribution. In this way, MOWS may be installed on a large number of computers without incurring costly software installation and updating process just as Java applets do for browser clients. In the current implementation, security issues are not seriously considered. One approach to increase security is to use authentication for trusted code.
Table 1 is a list of existing modules and their brief description.
Table 1. List of existing loadable modules
Module Name | Type | Use |
---|---|---|
CGIDirectory | Location | treats files in the specified directory as CGI programs |
CGIExtension | Extension | treats files with the specified extensions as CGI programs |
ContentType | ContentType | determines the content types of requested resources |
Directory | Directory | enables browsing of directory lists |
DiskCache | DiskCache | caches proxied files on disk |
FileSystem | Location | gets the requested file |
HTMLFilter | Filter | parses additional tags in HTML files |
HTTPProxy | Location | proxies for clients to retrieve remote files |
JMEDirectory | Location | treats files in the specified directory as JME programs |
Logging | Logging | logs the access information for requests |
MapExtension | Extension | treats files with the specified extensions as image map files |
MemoryCache | MemoryCache | caches local files in memory |
Redirection | Redirection | redirects requests in the preprocessing phase |
UserDatabase | UserDatabase | manages user information |
UserFileSystem | Location | maps the matching requests to their user's file system |
It is beyond the scope of this paper to describe all the modules in detail. Instead, we focus on the HTTPProxy and the JMEDirectory modules. Other modules are described in the online document available at the MOWS site.
The HTTPProxy module implements a proxy that can cooperate with other MOWS severs and Harvest object caches using the HTTP [7] and ICP [8] protocols. Each cooperating server can be a parent or sibling. A distributed hierarchical cache structure can be built on any system capable of running Java programs. Many servers, each capable of taking only a few clients, can potentially offer efficient access to files by cooperating with other distributed servers. Each server may have its own local HTTPProxy module or can load the HTTPProxy module from a remote module server. Although remote module loading is currently done directly from the remote server, modules may be loaded from the server's sibling or parent, when security issues are of less concern.
The HTTPProxy module may be used to dynamically mirror another web server.
Usually the mount parameter of this module is set to http:
and
the path parameter empty. In this case, a proxy request of the form
http://host/path
will be handled as a request to
host
for file /path
. By setting the path
parameter appropriately, the request can be mapped to a specific
host. In particular, to mirror site http://original-host/
at host copy-host
, its mount parameter can be set to
/
and its path parameter to
//original-host
. A request of the form /path
to host copy-host
will then be mapped to
http://original-host/path
. In this way, the
copy-host
server becomes identical to the
original-host
server for clients.
The JMEDirectory module allows simple gateway programs to be integrated with MOWS. Programs are extended from the JME class that is a piped thread object. It provides the input and output streams to read from and to write to the client connection, a hash table storing variables similar to those used in CGI, and an empty run method. Programmers just need to write their own run method. For example, a JME version of HelloWorld may be written as:
import mows.util.JME; import mows.util.RuntimeEnvironment; public class HelloWorld extends JME { public void run() { out.print("Content-type: text/plain\n\nHello, world!\n"); out.close(); } }
Users may implement a stateful gateway using a JME program that employs a static hash table to store the context of each session. They may use any Java-based distributed object computing tools such as RMI [18] or HORB [19] to add distributed object technology [20] to JME. If compatibility with CGI is of concern, users may use a Java wrapper program to encapsulate their JME programs. The resulting Java stand-alone programs can be executed as CGI programs on systems that do not support JME.
The two main parameters for the configuration of the JMEDirectory are
the mount and path parameters. The mount parameter represents the mount
point of the module in the URL space and the path parameter represents
the actual location of the directory where JME programs reside. This
directory can be local or remote. For a local directory, the path name
must be used. For a remote directory, a full name of the form
//remote-jme-host/jme-path
must be used. For security
reasons, when the directory is remote the path parameter must be
explicitly specified to avoid loading of JME programs from an
arbitrary host at the client's will. When a remote path is not
trustworthy, it should not be used. When loading a remote JME program,
any classes appearing in the program are first searched in the
server's local system. If they are not found, they are loaded from the
remote host. The advantages and security concerns mentioned earlier
for remote module loading also apply for remote JME code loading.
The configuration of MOWS is specified in a configuration file. An example is shown below:
(MOWSCenter 1.0 (Host mows.rz.uni-mannheim.de) (MOWS main (Port 8080) (LoggingModule mainlog (Log log/main.log)) (FileSystemModule fs1 (Mount /) (Path /users/mows/www/data) (HTAccess On) (Welcome index.html welcome.html)) (FileSystemModule fs2 (Mount /icons) (Path /users/mows/www/icons)) (CGIDirectoryModule cgi1 (Mount /cgi-bin) (Path /users/mows/www/cgi-bin)) (JMEDirectoryModule jme1 (Mount /jme-bin) (Path /users/mows/www/jme-bin)) (HTMLFilterModule fil1 (Extension .html .htm) (Size 8000)) (MapExtensionModule map1 (Extension .map)) (MemoryCacheModule memcache1 (Swap 10000000) (Size 50000))) (MOWS proxy (Port 3328) (LoggingModule proxylog (Log log/proxy.log)) (HTTPProxyModule proxy1 (Port 3330) (Mount http:) (Parent www-cache.uni-mannheim.de 3128 3130) (Sibling trumpf-7.rz.uni-mannheim.de 3328 3330) (Sibling trumpf-8.rz.uni-mannheim.de 3328 3330)) (DiskCacheModule diskcache1 (Swap 100000000) (Size 1000000))))
In this configuration, two MOWS servers are defined: one as a standard
web server at port 8080, and the other as a proxy cache server at port
3328. The web server uses two FileSystem modules for the top directory
and the icons directory. The cgi-bin directory is mounted by the CGIDirectory
module and the jme-bin directory by the JMEDirectory module. The HTMLFilter
module enables the translation of extra HTML tags such as
<accessed>
, <modified>
,
<date>
, etc into the access count, last modified
date, and current date. The MapExtension module enables processing of
image map files. The MemoryCache module caches accessed files in
memory using LRU replacement.
The proxy cache server uses the HTTPProxy module to perform proxying for clients. This module can use the specified hosts as its parent and sibling cache servers. On a local cache miss, the module communicates with these cache servers to determine the location of the requested file. It also listens on port 3330 for cache queries from sibling or child servers. The DiskCache module enables caching of proxied files on disk.
All the modules in the above configuration are loaded from the local
file system. In order to load a module from a remote system, one must give
the module name with its remote host name. For example, to load the
above FileSystem (fs1) module from
mows-remote.rz.uni-mannheim.de:8001
, one must specify:
... (//mows-remote.rz.uni-mannheim.de:8001/mows.module.FileSystemModule fs1 (Mount /) (Path /users/mows/www/data) (HTAccess On) (Welcome index.html welcome.html)) ...
This local host mows.rz.uni-mannheim.de
should not
have mows.module.FileSystemModule.class
in its file
system. Otherwise, this local copy will be loaded instead. At the same
time, the remote host mows-remote.rz.uni-mannheim.de
must
make its mows.module.FileSystemModule.class
available for
this host. If this file is placed in directory
/users/mows/export/mows/module/
, the configuration of
this remote server must be given as:
... (MOWS main (Port 8001) ... (FileSystemModule fs1 (Mount /) (Path /users/mows/export)) ...
As mentioned earlier, the performance of MOWS may be limited by its
interpretive execution. Furthermore, the current implementation of MOWS
is not optimized for performance. Although it is premature to give a detailed
performance analysis, in order to give some ideas of the current implementation
and of the expected performance gain for the future version, the
result of several retrieval benchmarks using the ptester
[21] hitter program is shown in Table 2.
Table 2. Relative performance of several servers
Server | single | multiple | script | ||||
---|---|---|---|---|---|---|---|
1a | 1b | 1c | 2a | 2b | 3a | 3b | |
Apache 1.1.1 | 66 | 39 | 20 | 67 | 80 | 25 | - |
Roxen 1.0,h | 50 | 36 | 21 | 74 | 92 | 13 | - |
Jigsaw 1.0a | 30 | 27 | 15 | 49 | 33 | - | - |
MOWS 0.9a as web server | 24 | 16 | 15 | 27 | 21 | 9 | 16 |
MOWS 0.9a as cache server | 29 | 19 | 15 | 37 | 33 | - | - |
The number in each entry represents the maximal number of successfully handled requests per second. The first three benchmarks 1a, 1b, and 1c show the results from the repeated retrieval of a single file of size 3k, 8k, and 23k respectively. The next benchmarks 2a and 2b show the results from the repeated retrieval of these three files with two and ten simultaneous hits per file, respectively. The last benchmarks 3a and 3b show the results from repeated execution of a CGI shell script that prints "Hello, world!" and a JME program that does the same. The results were collected on the same hardware configuration for a duration of 60 seconds, with each server running on a dual Pentium PC 133MHz with 96MB RAM under Linux 2.0.10, and the hitter program running on a Sparcstation 20 with 64MB RAM under Solaris 2.4. The two machines are connected by a local Ethernet.
As seen from the table, the performance of the current implementation of MOWS is reasonable. The interpretative execution in Java may not be a bottleneck for an ordinary use, as was also shown in the analysis of Jigsaw [22]. The current implementation demonstrates a prototype of MOWS in a very straightforward way. Its performance should increase when frequently used variables are aggressively cached in a hash table. In addition, the current memory module implements an LRU algorithm using a lock on the single LRU list. This may become a bottleneck at high load because synchronization is very expensive. Furthermore, the characteristics of web access differ from ordinary file systems. An asynchronous replacement policy using an access counter with a decay factor to approximate the working-set may be more effective for web document caching in memory.
The inefficiency of MOWS at very high load comes from the fact that several shared resources are accessed inefficiently by multiple threads and from its simple connection handling. The higher performance of MOWS in its cache server mode than in its web server mode is due to its smaller overhead in retrieving files from the DiskCache module than from the FileSystem module.
A comparison between the performance of JME and that of CGI shows that JME provides a simple and efficient alternative for CGI.
A number of interesting applications are possible with MOWS. In the following section, several applications are described that utilize MOWS functionality described earlier.
The simplest use of MOWS is for a plain web or cache server, as shown in Figure 1. Modules may be loaded from the local system or from a remote server to simplify the process of distributing and updating module software.
Figure 1.Standard web and cache server with local
and remote module loading.
MOWS can be used to cooperatively mirror a single server at several locations, or to simulate a distributed server with some fault tolerance, as shown in Figure 2.
The mirroring servers may communicate with each other in the ICP
and HTTP protocols to cooperatively retrieve files from the original
server. When a modified DNS or other IP mapping mechanism are used to
map a single host name to one of these live servers randomly, some
decrease in the original server's load and some fault tolerance can be
achieved because proxy servers may be set to ignore the HTTP header
Pragma: no-proxy
when the original server is down. If the
contacted proxy server does not have a copy of the requested file, it
can ask other proxy servers for a copy.
Due to high throughput needs for proxy servers, MOWS may be used to take only a few clients when it is used as a proxy cache. However, this drawback may be traded for the possibility for running a MOWS proxy on every Java-enabled system, as shown in Figure 3. Each proxy will take only a small number of clients whose access patterns may be characterized more easily than those used by a larger unknown population. Although the proxy cache cooperation structure is currently fixed in its configuration file, it may be made adaptive to both the access patterns and the hit statistics of its parent and sibling servers. If there are a number of small proxy servers with stable access patterns, they may be clustered together adaptively to form a more efficient sibling structure.
A JME program can be used to implement a stateful gateway in a simple way, as shown in Figure 4. Each new client will be assigned an identification key for its session. A JME program can store the state information of each client in a static table and use the client's key to retrieve its state or any associated session resources.
Another use of JME is to let several servers remotely load JME programs for the optimal use of computing resources. A library of JME programs can be kept at one central host and clients may choose an appropriate server for the execution of particular code. This feature also allows proxy servers to participate in computing using their own code or designated remote code rather than simply forwarding requests and data.
This paper presented a distributed web and cache server called MOWS. The system architecture, basic performance, and applications of MOWS were described. The current implementation is available as an alpha release from the MOWS site "http://mows.rz.uni-mannheim.de/mows/". This site also provides additional information on MOWS.
Work is currently under way in the following areas:
The author thanks Heinz Kredel, Anas Nashif, and Chris Connelly for their comments and suggestions.