Wide-area file systems provide substantial benefits over other data transfer protocols such as HTTP or FTP. Among these are decreased server and network load, improved latency, and more complete security. These benefits result from wide-area file system support for a global name space, location transparency, client caching, data replication, and access control mechanisms. File systems like DFS [8], AFS [7] and NFS [5] demonstrate to a different degree the viability of the wide-area file system paradigm.
AFS serves as an example of the facilities typically available in a wide-area file system. Using a set of trusted servers, AFS presents to clients a location-transparent, hierarchical name space. Files and directories are cached on the local disks of clients using a consistency mechanism based on callbacks [9].
A volume consists of a set of files and directories located on one server and forms a partial subtree of the shared name space [11]. The distribution of volumes across servers is an administrative decision. Volumes that are frequently read but rarely modified (such as system binaries) may have read-only replicas at multiple servers to enhance availability, to distribute server load, and to obviate the need for callbacks.
AFS uses access control lists for protection. For performance reasons the granularity of protection is an entire directory rather than individual files. Users can be members of protection groups. Access control lists may specify rights for both users and groups. User authentication relies on Kerberos [13].
AFS supports multiple administrative cells, each with its own servers, clients, system administrators and users. Each cell is a completely autonomous environment. But a federation of cells can cooperate in presenting users with a uniform, seamless file name space. At the time of writing this paper, more than 80 organizations around the world are part of the publicly accessible AFS wide-area distributed file system [12] and many others participate in corporate federations.
Wide area file systems provide a location transparent file name space. That is, a file or volume can reside on any file server within a cell. This approach has two advantages. Transparent file migration is one advantage. An administrator can migrate a volume from a busy server to an idle server to balance load. The name of the file within the volume does not change when the volume migrates. In contrast, HTTP and FTP must change the name of the file---and all corresponding URLs---or migrate the entire name space to a new machine.
Transparent replication is another advantage. An administrator can replicate a popular volume on several file servers to distribute access. The load on each server is decreased significantly without affecting the name of any file exported to the WWW. Thus, server load and document availability can be improved without changing any document pointers or URLs.
Wide-area file systems use an aggressive file caching policy to reduce the network load and access latency. When a user accesses a file, the wide-area file system first checks the local cache for a copy of the file. With a typical file access, a user has a "working set" of files that remains consistent for a period of time. A cache replacement policy such as "least recently used" assumes that files in the working set will be accessed again in the near future.
The access patterns associated with information browsing exhibit less temporal locality than traditional file access patterns. Information access exhibits many read-once patterns; i.e. documents are accessed once and discarded. Therefore, the benefits of caching documents during information access are less than with typical file access.
The results of several studies of server and client traces [4, 6] confirm this observation. The percentage of duplicate requests coming from the same user or a group of users belonging to the same organization fall in the 20-60% range. In contrast, our analysis of WWW server traces collected at the School of Computer Science, Carnegie Mellon University reveals the existence of a relatively small "hot set" of documents which absorb most of the references. Therefore, a good caching strategy for information access should exploit prefetching based on usage statistics. We further explore this idea in Section 4.
Undoubtedly, the current design of the WWW can be extended to provide secure information sharing. However, the existing wide-area file systems already provide adequate security and detailed specification of access control rights.
Security in a wide-area file system is founded on an authentication mechanism and secure RPC between servers and clients. While all participating sites have to agree on the common protection and authorization model, each site has full control in implementing individual security policies.
In a wide-area file system every file or directory has an access control list which specifies the access rights. An access control list is a set of pairs; the first item in each pair is a name of a user or a group, and the second is the information regarding the rights granted to that user or a group. Users are allowed to create new groups and also to specify negative rights. This authorization model allows fine grain specification of access control rights for every user and every part of the wide-area file system.
Wide-area file system transfer rates depend on the speed of the underlying data transfer protocol and the overhead in maintaining connections between servers and clients. Table 1 presents the results of measuring file transfer rates for AFS and HTTP. Our measurements were performed for both cold and warm cache cases and on different file sizes.
The table presents the transfer rates in bytes per second for AFS and HTTP. Files reside on an IBM RS/6000 server running AFS v3.2 and NCSA HTTPD v1.3. Traffic to the client, an IBM RS/6000 running AFS v3.3, must pass through three 10Mbps Ethernet segments. We believe that this configuration is representative of other HTTP sites.
Clearly, AFS performs better than HTTP for both cold and warm caches. The AFS transfer protocol, Rx, requires less state to establish a connection and less message buffering. In addition, it uses network bandwidth more aggressively than HTTP. As a result, Rx efficiently transfers files with lengths between 1K-100K bytes. According to information collected from the Carnegie Mellon University Web server, 77% of all transfers fall into this range.
Cold Cache Warm Cache File Size AFS HTTP AFS HTTP 100 5 1 13 1 1K 14 5 148 9 10K 102 34 1,364 60 100K 205 107 5,268 248 1M 232 163 6,630 370Table 1: File Transfer Performance in Bytes per Second