Ludmila Cherkasova Hewlett Packard Labs 1501 Page Mill Road Palo Alto, CA 94303, USA cherkasova@hpl.hp.com
|
Mohan DeSouza University of California Dept. of Computer Science Riverside, CA 92521, USA mdesouza@cs.ucr.edu
|
Jobin James University of California Dept. of Computer Science Riverside, CA 92521, USA jobin@cs.ucr.edu
|
The shared Web hosting market targets small and medium size businesses.
The most common purpose of a shared hosting web site is marketing (in other
words, it means that most of the documents are static). In this case,
many different sites are hosted on the same hardware. A shared Web hosting
service creates a set of virtual servers on the same server.
Each virtual server is set-up to write its own access log.
Such implementation and set-up, however, splits the ``whole picture''
of web server usage into multiple independent pieces, making it
difficult for the service provider to understand and analyze the
``aggregate'' traffic characteristics.
The situation gets even more complex when a Web hosting infrastructure
is based on a web server farm or cluster, used to create a scalable and
highly available solution.
There are several web log analysis tools available
( Analog,
Webalizer,
WebTrends to name just a few).
They give detailed
data analysis useful for business sites to understand their customers
and customers interests. However, these tools lack the
information which is of interest to system administrators; the information which provides insight into the system's
resource requirements and traffic access patterns.
Shared Web Hosting Analysis Tool (WHAT) aims to provide a Web hosting service
profile and characterize the system's usage specifics and trends:
- service characterization - a service profile, a
comparative analysis of system resource usage by hosted web sites;
- traffic characterization - a comprehensive analysis of overall
workload with extraction of a few main parameters to characterize it;
- system requirements characterization - a related system resource
usage analysis, especially memory requirements.
These characteristics provide an insight into the system's resource
requirements and traffic access patterns - the information which
is of special interest to system administrators and service
providers.
Service Characterization
While the typical growth for most of the sites is
exponential, it takes different times for different sites to
double. Some of the sites experience decrease of the traffic rates and
actually demonstrate negative growth. User access patterns differ
significantly too. For example, some sites have a few, very popular
documents or products. The accesses to such sites are heavily skewed:
2% of the documents account for 95% of the sites' traffic. In order
to design an efficient, high quality Web hosting solution, the
specifics of access rates and users' access patterns should be taken
into account. The traffic growth/decrease and the users' access
patterns' changes should be monitored in order to provision for those
changes well in time and in the most efficient way.
WHAT identifies all the different hosted web sites (from the given
collection of web server access logs). For each hosted web site i,
the tool builds a site profile by evaluating the following
characteristics:
- AR_i - the access rates to a customer's content (in bytes
transferred during the observed period);
- WS_i - the combined size of all the accessed files (in bytes
during the observed period, so-called ``working set'').
We normalize both AR_i and WS_i with respect to
AR and WS
combined over all the sites in order to identify the percentage
contribution of each particular site.
The access rate AR_i gives an approximation of the load
to a server
provided by the traffic to the site i . The working set WS_i
characterizes the memory (RAM) requirements by the site i .
These parameters provide a high level characterization of customers
(hosted web sites) and their system resource requirements. The sites
profiles accumulated on a daily (weekly) basis allow to
derive growth trends for those sites. ``Combined trend'' help to
evaluate and, more important, to predict the overall ``aggregate''
service growth, and do capacity panning and scaling of the underlying
infrastructure accordingly.
Traffic Characterization
It reports the number of successful requests (code 200),
conditional_get requests (code 304) and errors (the rest of the
codes). The percentage of conditional_get requests often indicates
the ``reuse'' factor for the documents on a server. These are the
documents cached somewhere in the Internet by proxy caches.
WHAT provides statistics for the average response-file-size
(averaged across all successful requests with 200 code). We also build
a characterization of the file size distribution (average response size for 30/60/90% of all (200 code) requests).
WHAT reports a percentage of the files requested only a few times - the files
requested less than 2/6/10 times. This is another important characterization
of traffic which has a
close connection to document reuse and gives indication of memory
(RAM) efficiency for the analyzed workload. Most likely ``onetimers''
are the requests served from disk. This data is helpful in
understanding whether performance improvements can be achieved via
optimization of the caching or replacement strategy.
System Requirements Characterization
System requirements are characterized by the combined access rate and
working set of all the hosted sites (during the observed period of
time).
WHAT provides the combined size of ``onetimers''. High
percentage of ``onetimers'' and small memory size could cause bad site
performance.
In order to characterize the ``locality'' of overall traffic to the
site, we build a table of all accessed files with their sizes and
access frequency information, ordered in decreasing order by
frequency. WHAT provides working sets for 97/98/99% of all
(200 code) requests. The smaller numbers for 97/98/99% of the working
set indicate higher traffic locality: this means that the most
frequent requests target a smaller set of documents.
This characterization is very important for capacity planing and
provisioning goals. Practically, 97/98/99\% of the working set allows
to optimally plan for the actual RAM size of the underlying hardware
for web servers. It characterizes the ``actively used'' file subset
independently of the overall ``passive'' file set size.
More details on WHAT and its usage can be found in [CDJ99]:
L. Cherkasova, M.DeSouza, J.James:
Web Hosting Analysis Tool for Service Providers.
HP Laboratories Report No. HPL-1999-150, 1999.