Next: approach Up: Web traffic characterization: an Previous: Web traffic characterization: an

motivation

Since February 1993 the National Center for Supercomputing Applications (NCSA)http://www.ncsa.edu has operated a digital library server accessed by several hundred thousand users worldwide using the NCSA Mosaic software package. NCSA Mosaic is able to interface with several other information resource discovery service (IRDS) protocols in addition to the WorldWideWeb (www), or http protocol; it also includes support for gopher, wais, ftp, nntp (news), and even telnet and finger. During the past 18 months, since widespread introduction of NCSA Mosaic into the Internet, usage of the server has doubled roughly every 6-8 weeks, reaching 2.54 million transactions per week in early September 1994.

In the face of the overwhelming growth, NCSA has already implemented a round-robin scheme to distribute the www workload across a cluster of servers. Other major information servers, such as CERN and the UIUC Weather Server, are also beginning to suffer under the rapidly increasing load. At least a dozen www server sites are experiencing similar growth to that of NCSA, increasing the urgency for a more comprehensive and systematic approach to the analysis of both the requirements of server architectures and the impact of this rapid growth on the Internet today and in the future.

Figure 1 shows the growth of traffic from several IRDS applications on the NSFNET backbone, comparing it to the overall growth of NSFNET backbone traffic.

Although the figure illustrates the exceptional growth of www traffic, simple metrics such as packet or byte counts do not reveal the burden that www traffic imposes on individual servers or how long before the demand reaches a threshold which the server and/or network cannot support. A rapid growth in information provisioning has occurred without an overall architecture to address models of information resources and their deployment strategies, e.g., optimized interaction between applications and network protocols, or overall network efficiency considerations such as file caching [1]. As a result, measurements that would indicate performance thresholds are not widely available.

Indeed, tracking www statistics on an Internet systems scale poses difficulty. A more rational architecture for www statistics collection at an Internet systems level would help, even more so in the face of a shift from hypertext to images in information server content which will accelerate traffic growth and likely change the characteristics of the resulting load. As such services become widespread on the Internet, we will need models of IRDS traffic load and server ability to cope. Prerequisite to such models is an understanding of the nature of the traffic from these new applications. Characterizing the service impact, including different types of server content, will require more complex metrics than simple packet, byte, and transaction counts.

In this paper we offer an example of how web workload characterization can suggest refinements to the web architecture. Specifically, we provide a geographic characterization of queries of a well-known, widely and heavily used server. Several people have written log file analysis programs for web servers [5][4][3][2] but to our knowledge none has broken down the requests by geographic source within the United States.

Next: approach Up: Web traffic characterization: an Previous: Web traffic characterization: an

kc@
Thu Sep 15 22:53:05 PDT 1994