|
|
|
Web Engineering Track of WWW2002 |
|
7-11 May, 2002. Honolulu, Hawaii |
|
|
|
Daniel A. Menascé * |
|
Bruno D. Abrahão + |
|
Daniel Barbará & |
|
Virgílio A. F. Almeida + |
|
Flávia P. Ribeiro + |
|
* Dept. of Computer Science - George Mason
University |
|
+ Dept. of Computer Science - Federal University
of Minas Gerais, Brazil |
|
& Dept. of Information and Software
Engineering - George Mason University |
|
|
|
|
Actual workloads are complex and have a large
number of elements. |
|
It becomes a necessity to reduce and summarize
the log information. |
|
Capture the most relevant characteristics of
real workloads. |
|
|
|
|
|
|
|
Workload Characterization |
|
Capacity planning of Web services. |
|
Performance understanding. |
|
Improvement of the quality of a customer's
experience at a site. |
|
|
|
|
|
To use the correlation fractal dimension to
improve the understanding of the workloads |
|
Reduce complexity. |
|
Find hidden relationships. |
|
Improve the efficiency of finding the most
important features in the workload. |
|
|
|
|
|
Clustering techniques group elements that are
similar with respect to some metric. |
|
|
|
Traditional Clustering Algorithms |
|
Use distance as the similarity metric. |
|
Clusters are restricted to regular geometric
shapes. |
|
Artificial grouping of elements. |
|
Meaningless interpretation of the data. |
|
|
|
|
Represents the log as a collection of session
vectors. |
|
Each dimension corresponds to one e-business
function (e.g., search, add, pay). |
|
indicates the number of times function i was
invoked during the j-th session. |
|
|
|
|
|
CVMs extracted from the HTTP log of an online
bookstore |
|
Period: 15 days |
|
955,818 HTTP requests |
|
130,314 sessions |
|
Requests were mapped to 12 e-business functions,
i.e., the CVM has 12 dimensions |
|
|
|
|
|
|
|
|
k-means |
|
Input: k – the number of clusters we intend to
find. |
|
What is the best k? |
|
Maximize inter-cluster distance. |
|
Minimize intra-cluster distance. |
|
Output: |
|
k centroids of the k clusters. |
|
|
|
|
Analyzing the results for 7 clusters: |
|
9.2% of
customers changed their mind. |
|
0.3%
bought something. |
|
85.8% non-serious customers(hit&run). |
|
4.7%
are robots. |
|
|
|
|
The centroid of buyer users and crawler robots is the same regardless
of the number of clusters. |
|
Crawlers and Shopbots seem to be very well
separated. |
|
K=5 does not distinguish between chm and hit&run,
which differ in add activity. |
|
|
|
|
Uses Correlation Fractal Dimension as the
similarity metric. |
|
Capable of recognizing clusters of arbitrary
shapes. |
|
Explores the self-similar (or fractal)
properties of the data. |
|
|
|
|
Can be computed by the pair-count method or by
the box counting method. |
|
Resulting plots follow a Power Law. |
|
The slope of the log-log plot is the Correlation
Fractal Dimension of the dataset. |
|
is sensitive to the spatial distribution of
points in the dataset. |
|
|
|
|
1. Compute D of dataset S. |
|
while there are attributes in S { |
|
2.
for each attribute i compute the partial fractal dimension. |
|
3.
Select the attribute k that leads to the minimum |
|
difference between D and the partial fractal dimension. |
|
4.
Remove k from S. |
|
5.
Compute D of S. |
|
} |
|
This is an effective way of reducing the
dimensionality of the problem, and it also renders as a byproduct
information about the features. |
|
|
|
|
|
|
|
|
Robots were considered outliers. |
|
FC results include only human sessions. |
|
|
|
|
|
FC revealed
a correlation between the (browse + info + search) and add in every
cluster using the Chi-Square test. |
|
Clusters are well formed. |
|
undetectable by k-means results. |
|
E.g.: |
|
K-means do not provide a sharp differentiation
on add activity while FC captured that diferences in the clusters. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Distance-based clustering cannot uncover some
characteristics of the dataset. |
|
Fractal methods can simplify and improve the
quality of characterization. |
|
Reduce the number of attributes and tuples to be
considered. |
|
Reveals relevant correlations among attributes. |
|
Improve the quality of clusters, leading to
meaningful interpretation of the dataset. |
|
|
|
|
Daniel A. Menascé |
|
Bruno D. Abrahão |
|
Daniel Barbará |
|
Virgílio A. F. Almeida |
|
Flávia P. Ribeiro |
|
|
|