Techniques and Metrics for Improving Website Structure

E.Christopoulou^1,2
hristope@ceid.upatras.gr

J. Garofalakis^1,2
garofala@cti.gr

C. Makris^1,2
makri@ceid.upatras.gr

Y. Panagis^1,2
panagis@ceid.upatras.gr

A. Psaras-Chatzigeorgiou¹
psaras@ceid.upatras.gr

E. Sakkopoulos^1,2
sakkopul@ceid.upatras.gr

A. Tsakalidis^1,2
tsak@cti.gr

¹Department of Computer Engineering & Informatics School of Engineering, University of Patras
Rio Campus, 26500 Patras, Greece

²Research Academic Computer Technology Institute Internet and Multimedia Technologies Research Unit
61 Riga Feraiou Str. 26110 Patras, Greece

ABSTRACT

In this work two distinct metrics are proposed, which aim to quantify the importance of a web page based on the visits it receives by the users and its location within the website. Subsequently, certain guidelines are presented, which can be used to reorganize the website, taking into account the optimization of these metrics. Finally we evaluate the proposed algorithms using real-world website data and verify that they exhibit more elaborate behavior than a related simpler technique.

Keywords

Web Metrics, Web Organization, Log File Processing.

1. INTRODUCTION

User visits' analysis is the first step in any kind of web site evaluation procedure; in order to assist in this, many commercial systems provide statistics about the most visited files and pages. However in [4] is shown that, the number of hits per page, calculated from log file processing, is an unreliable indicator of page popularity. Thus, a refined metric is proposed, which takes into account structural information, and, when using it, certain pages are reorganized leading to the overall improvement to site access. Other researchers have attempted to identify user behavior patterns, Chen et al [2], and to analyze the paths that users follow within a site, Berkhin et al [1]. A very influential recent work is that of Srikant and Yang, who furthermore suggest structural changes to the website, after having identified visit patterns that deviate from the site's initial organization. On the other hand, very little progress has been achieved in providing a software tool or even framework that would assist in automatically applying changes to the web site. Some early steps towards this direction can be found in [3].

This paper introduces two new popularity metrics. The first one differentiates between users coming from within the website and users coming from other websites, while the second one uses a probability model to reassess popularity. Key feature of the new metrics is the higher fidelity of the popularity estimates. We evaluate and examine these metrics by comparing them with the metrics introduced in [4].

2. ESTIMATING PAGE POPULARITY

The Absolute Accesses (AA_i) to a specific page i of a site is not a reliable metric to estimate page popularity. Thus, Garofalakis et. al. [4] defined the refined metric Relative Access (RA_i) as:
RA_i = a_i*AA_i (1)
The RA of page i is a result of the multiplication of AA_i by a coefficient a_i. The purpose of a_i is to skew AA_i in a way that better indicates a page's actual importance, incorporating topological information, namely page depth within site d_i, the number of pages at the same depth n_i and the number of pages within site pointing to it r_i. Thus a_i = d_i + n_i/r_i. This metric will be henceforth refered to as GKM. .

2.1 Redefining a_i

It is crucial to observe that a web page is accessed in four different ways. Firstly it gets accesses within site, secondly directly via bookmarks, thirdly by incoming links from the outside world and finally by typing directly, its URL. Under this observation, we can decompose a_i, into two factors, a_i,in and a_i,out. The first one, a_i,in, reflects the ease of discovering page i , under the specific site organization. This factor is similar to [4], but significantly redefined. The new factor that we introduce, a_i,out, designates the importance of a specific page for the outside world, i.e. pages that point to i from other domains and bookmarks to i. Subsequently, we discuss two approaches in defining a_i,in and a_i,out.

2.2 Using Topological Information

We define the following quantities: d_i is the tree depth of the page i and n_i denotes the outdegree of page i. Thus a_i,in is defined as:
, (2)
where j is the parent of i inside the site's tree.

Hence, a_i,in depends on the depth of page i (the deeper i lies the more its discovery gets value), the number of i's siblings (many siblings raise the importance of the i's choice) and the number of other pages pointing to it (few ways to access i, so the access figures it gets shall be raised). Normalizing each influence factor, yields a_i,in<=3.

In order to define a_i,out we denote as B_i, the number of bookmarks for page i, and as L_i, the number of links from other web pages to page i . So: ,
(3)

Equation 3 implies that a_i,out depends on the number of both bookmarks and links from outside to page i. Many bookmarks indicate high popularity among the visitors of our site, which is exactly what we want to capture. Moreover we include the factor d_i/d_max, weighing the number of bookmarks w.r.t. the page's depth. The constant additive factor at the end, comes in order to make a_i,out<=3, just as with a_i,in, so that they both have the same maximum potential influence on RA_i (see Equation 4). In order to apply our newly derived factors, we also separate into two parts: and , that keep the accesses from inside the site, exclusively for page i (and not for its children) and the accesses having an antecedent from another domain, respectively. So the relative access for page i, RA_i is:
(4)

2.3 Using Routing Probabilities

It is tempting to model traffic inside a site, using a random walk approach (e.g. like in [5]) but a_i,in, models the ease (or difficulty) of access that a certain site infrastructure imposes to the user. Thus, a page's relative weight, should be increased inversely proportional to its access probability. We consider the site's structure as a directed acyclic graph (DAG), where v_i denotes page i. A user starts at the root page v_r, looking for an arbitrary site page v_t and at each node v_i he makes two kinds of decisions: either he stops browsing or he follows one of the out(v_i) links to pages on the same site. If we consider each kind equally probable, the probability p_i, of each decision is p_i = (out(v_i)+1)^-1.

Considering a path W_j = {v_r,..., v_t} and computing the routing probabilities at each step, the probability of ending up to v_t via W_j, is:

There may be more than one paths leading to t, namely W₁, W₂,..., W_k. Thus, D_t , the overall probability of discovering t , is:
(6)

Considering page i as target, the higher D_i is the lower a_i,in shall be, so we choose a_i,in to be, a_i,in = 1 - D_i. We also let a_i,out=1 , considering each external access as a path of length one. Thus we define RA_i as:
(7)
Our metrics will be refered to as TOP and PROB, for short

3. CONDUCTING THE EXPERIMENT

In order to evaluate the results of the described algorithms we used the web server log from http://www.ceid.upatras.gr (Computer Engineering & Informatics Dept., University of Patras). We obtained a web log covering 44 days, including 1,320,819 records (hits) and 3596 unique visitors (Feb 2002 - March 2002). After having analyzed the site structure in order to recognize the pages of our interest, we identified a structure of 4 levels and 98 pages. We also implemented pre-processing, parsing, distilling and extracting procedures in order to filter out unwanted raw data from log files and focus only on entries corresponding to pure HTML pages. We implemented our proposed algorithms and metrics, the corresponding pre-processing procedures and the GKM algorithm and metric using the Mathworks Matlab v6.5 language.

4. ASSESSMENT OF RESULTS (METRICS)

The GKM metric provides a rough estimate of the multiplicative factor a_i . As the latter is affected by the number of pages at the same level, the RAs computed for sites of large width are considerably larger than the ones stemming from our metrics. Another issue is that pages lying deep down the site hierarchy get overly benefited, because of the tree-like evolution of web sites. Both our approaches alleviate such phenomena, displaying a more balanced consideration of "node scoring". Considering the TOP metric, we normalize the influence of both depth and width to the RA score and we don't take into account every page at the same depth but only the siblings of a certain node. Our experimental results show that, figures computed by our metrics are, in about an order of magnitude, smaller than the respective ones of GKM, an indication that our techniques are more elaborate. Another advantage of our approaches is the incorporation of bookmarks, which are an important indicator of page quality.

5. CONCLUSIONS AND FUTURE WORK

This work aims to provide refined metrics, useful techniques and the fundamental basis for high fidelity website reorganization methods and applications. Future steps include the description of a framework that would evaluate the combination of reorganization metrics with different sets of redesign proposals. We also consider as an open issue the definition of an overall website grading method that would quantify the quality and visits of a given site before and after reorganization.

6. REFERENCES

Berkhin, P., Becher, J.D. & Randall, D.J. Interactive path analysis of web site traffic. Proceedings of KDD 01 pp. 414-419, 2001.
Chen Ming-Syan, Jong Soo Park, & Philip S. Yu. Data mining for path traversal patterns in a web environment. In Proc. of the 16th International Conference on Distributed Computing Systems, pp. 385-392, 1996.
Christopoulou, E., Garofalakis, J, Makris, C., Panagis, Y., Sakkopoulos, E. & Tsakalidis, A. Automating restructuring of web applications, poster presentation in ACM HT '02.
Garofalakis, J.D., Kappos, P. & Mourloukos, D.: Web Site Optimization Using Page Popularity. IEEE Internet Computing 3(4): 22-29 (1999)
Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. JACM 46(5): 604-632 (1999)
Srikant, R., Yang, Y.: Mining web logs to improve website organization. In Proceedings of WWW10, pp 430-437, 2001

Techniques and Metrics for Improving Website Structure

E.Christopoulou1,2 hristope@ceid.upatras.gr

J. Garofalakis1,2 garofala@cti.gr

C. Makris1,2 makri@ceid.upatras.gr

Y. Panagis1,2 panagis@ceid.upatras.gr

A. Psaras-Chatzigeorgiou1 psaras@ceid.upatras.gr

E. Sakkopoulos1,2 sakkopul@ceid.upatras.gr

A. Tsakalidis1,2 tsak@cti.gr

1Department of Computer Engineering & Informatics School of Engineering, University of Patras Rio Campus, 26500 Patras, Greece

2Research Academic Computer Technology Institute Internet and Multimedia Technologies Research Unit61 Riga Feraiou Str. 26110 Patras, Greece