Since February 1993 the National Center for Supercomputing Applications (NCSA)http://www.ncsa.edu has operated a digital library server accessed by several hundred thousand users worldwide using the NCSA Mosaic software package. NCSA Mosaic is able to interface with several other information resource discovery service (IRDS) protocols in addition to the WorldWideWeb (www), or http protocol; it also includes support for gopher, wais, ftp, nntp (news), and even telnet and finger. During the past 18 months, since widespread introduction of NCSA Mosaic into the Internet, usage of the server has doubled roughly every 6-8 weeks, reaching 2.54 million transactions per week in early September 1994.
In the face of the overwhelming growth, NCSA has already implemented a round-robin scheme to distribute the www workload across a cluster of servers. Other major information servers, such as CERN and the UIUC Weather Server, are also beginning to suffer under the rapidly increasing load. At least a dozen www server sites are experiencing similar growth to that of NCSA, increasing the urgency for a more comprehensive and systematic approach to the analysis of both the requirements of server architectures and the impact of this rapid growth on the Internet today and in the future.
Figure 1 shows the growth of traffic
from several IRDS applications on the NSFNET backbone,
comparing it to the overall growth of NSFNET
backbone traffic.
Although the figure illustrates the
exceptional growth of www traffic,
simple metrics such as packet or byte counts do
not reveal the burden that www traffic imposes
on individual servers or how long
before the demand reaches a threshold
which the server and/or network cannot support.
A rapid growth in information provisioning has
occurred without an overall architecture to
address models of information
resources and their deployment strategies,
e.g., optimized interaction between applications
and network protocols, or overall
network efficiency considerations such as
file caching [1]. As a result, measurements that would indicate
performance thresholds are not widely available.
Indeed, tracking www statistics on
an Internet systems scale poses difficulty.
A more rational architecture for www statistics
collection at an Internet systems level would help,
even more so in the face of a shift from hypertext to
images in information server content
which will accelerate traffic growth and likely
change the characteristics of the resulting load.
As such services become widespread on the Internet,
we will need models of IRDS traffic load and
server ability to cope.
Prerequisite to such models
is an understanding of the nature of
the traffic from these new applications.
Characterizing the service impact, including
different types of server content,
will require more complex metrics than simple packet,
byte, and transaction counts.
In this paper we offer an example of how
web workload characterization can suggest
refinements to the web architecture.
Specifically, we provide a geographic
characterization of queries of a well-known, widely and
heavily used server.
Several people have written log file analysis
programs for web servers
[5][4][3][2]
but to our knowledge none has broken
down the requests by geographic source
within the United States.
Next: approach
Up: Web traffic characterization:
an
Previous: Web traffic characterization:
an