Anotações
Estrutura de tópicos
Fractal Characterization of Web Workloads
Web Engineering Track of WWW2002
7-11 May, 2002. Honolulu, Hawaii
Daniel A. Menascé *
Bruno D. Abrahão +
Daniel Barbará &
Virgílio A. F. Almeida +
Flávia P. Ribeiro +
* Dept. of Computer Science - George Mason University
+ Dept. of Computer Science - Federal University of Minas Gerais, Brazil
& Dept. of Information and Software Engineering - George Mason University
Problem
Actual workloads are complex and have a large number of elements.
It becomes a necessity to reduce and summarize the log information.
Capture the most relevant characteristics of real workloads.
Motivation
Workload Characterization
Capacity planning of Web services.
Performance understanding.
Improvement of the quality of a customer's experience at a site.
Goal
To use the correlation fractal dimension to improve the understanding of the workloads
Reduce complexity.
Find hidden relationships.
Improve the efficiency of finding the most important features in the workload.
Data Reduction : clustering
Clustering techniques group elements that are similar with respect to some metric.
Traditional Clustering Algorithms
Use distance as the similarity metric.
Clusters are restricted to regular geometric shapes.
Artificial grouping of elements.
Meaningless interpretation of the data.
Workload Representation
Represents the log as a collection of session vectors.
Each dimension corresponds to one e-business function (e.g., search, add, pay).
  indicates the number of times function i was invoked during the j-th session.
Case Study
CVMs extracted from the HTTP log of an online bookstore
Period: 15 days
955,818 HTTP requests
130,314 sessions
Requests were mapped to 12 e-business functions, i.e., the  CVM has 12 dimensions
E-business Functions
Distance Based Clustering
k-means
Input: k – the number of clusters we intend to find.
What is the best k?
Maximize inter-cluster distance.
Minimize intra-cluster distance.
Output:
k centroids of the k clusters.
K-means Results
Analyzing the results for 7 clusters:
9.2%  of customers changed their mind.
0.3%   bought something.
85.8% non-serious customers(hit&run).
4.7%   are robots.
K-means Results
The centroid of buyer users  and crawler robots is the same regardless of the number of clusters.
Crawlers and Shopbots seem to be very well separated.
K=5 does not distinguish between chm and hit&run, which differ in add activity.
Fractal Clustering
Uses Correlation Fractal Dimension as the similarity metric.
Capable of recognizing clusters of arbitrary shapes.
Explores the self-similar (or fractal) properties of the data.
Correlation Fractal Dimension
Can be computed by the pair-count method or by the box counting method.
Resulting plots follow a Power Law.
The slope of the log-log plot is the Correlation Fractal Dimension of the dataset.
is sensitive to the spatial distribution of points in the dataset.
Attribute Selection Algorithm
1. Compute D of dataset S.
while there are attributes in S {
       2. for each attribute i compute the partial fractal dimension.
   3. Select the attribute k that leads to the minimum
            difference between D and the partial fractal dimension.
   4. Remove k from S.
       5. Compute D of S.
}
This is an effective way of reducing the dimensionality of the problem, and it also renders as a byproduct information about the features.
Attribute Selection
Fractal Clustering Method(FC)
Fractal Clustering Results
Robots were considered outliers.
FC results include only human sessions.
FC Results
FC revealed  a correlation between the (browse + info + search) and add in every cluster using the Chi-Square test.
Clusters are well formed.
undetectable by k-means results.
E.g.:
K-means do not provide a sharp differentiation on add activity while FC captured that diferences in the clusters.
Comparing Results
K-means Behavior
K-means Behavior
FC Behavior
Conclusion
Distance-based clustering cannot uncover some characteristics of the dataset.
Fractal methods can simplify and improve the quality of characterization.
Reduce the number of attributes and tuples to be considered.
Reveals relevant correlations among attributes.
Improve the quality of clusters, leading to meaningful interpretation of the dataset.
Fractal Characterization of Web Workloads
Daniel A. Menascé
Bruno D. Abrahão
Daniel Barbará
Virgílio A. F. Almeida
Flávia P. Ribeiro