Auditing Web Sites Using Their Access Patterns

Émilie Danna, Arnaud Laroche
École Nationale de la Statistique et de l'Administration Économique, France
Gesmad, France
Emilie.Danna@ensae.fr, laroche@gesmad.fr

Introduction

No longer are web sites merely showcases for the companies they represent. They are actually useful, both to users and to their instigators. As such, web sites should have two objectives: first, allow users to find the information they are looking for -- or allow them to enjoy browsing through the web site; secondly, attain their instigators' aims -- sell products, broaden users' base and win the loyalty of users. How can it be known whether or not a web site attains these two objectives?

How to evaluate and improve a web site?

Methods used at the present time -- that is a web professional evaluating a series of the web site's characteristics or users filling in an on-line questionnaire -- have three drawbacks:

The population sample reviewing the web site is not representative of the whole population visiting the web site, and hence the results are not significant.
There is no way of ensuring the reliability or the accuracy of the information collected because the answers given by users are only based on what they perceive, on what they remember and on what they make public concerning the features they liked or disliked, not what they actually experienced while visiting the web site.
A questionnaire is often a directed knowledge discovery tool.

On the other hand, various data mining techniques have been applied to web server logs to create adaptive web sites: web sites that automatically improve their organization and presentation by learning from user access patterns. While this aim is very large, only specific tasks, such as index page synthesis, linking similar pages or highlighting links to popular pages, have been considered.

These are the reasons why we are presenting another method for auditing and proposing a complete redesign for a web site. By mining user access logs using a fuzzy clustering algorithm, we can identify user profiles whose interpretation gives us recommendations for changing the web site. This method compensates for the three drawbacks mentioned above. By drawing a sample population with a simple random algorithm, we ensure that the sample population is representative of the whole population visiting the web site. By collecting data directly from the web server log, we are not subjected to any psychological user bias. Because an undirected knowledge discovery algorithm is used, the results do not depend on any a priori web design that we impose.

Creating user profiles

In order to use a clustering algorithm to create user profiles, we need a dissimilarity measure to determine if two user sessions are similar. We have built a dissimilarity measure based on:

the number of pages seen during each session, since this number is a measure of the user's interest in the web site.
the ratio of the number of content pages to that of auxiliary pages browsed among during each session, since the ratio indicates whether or not the user was able to find quickly what he/she was looking for.
the content of each page browsed, approximated by its URL.
the ``compacity'' of each session. This indicator approximately renders the number of different subjects of interest browsed among by the user, that is the dispersion of the contents of the all pages seen by the user.

A strong emphasis is placed on pages that the user looks at longer than the average user. This choice is less arbitrary than considering a session as being characterized by the content pages or the pages browsed among by the end of the session.

A fairly standard fuzzy clustering algorithm was used. The aim of fuzzy clustering is not to make a clear-cut decision for each session and determine which cluster it belongs to. It is rather to determine the distribution of the membership coefficients of each session to each cluster. This operation allows for some ambiguity in the data, by saying ``this session belongs for the most part to this cluster'' or ``that session is divided equally between those two clusters''. We chose a fuzzy clustering algorithm because we needed to know if the distance we had built was discriminating. Fuzzy clustering enables us to know how fuzzy the classification is, that is, if the chosen distance really emphasizes the differences between several user profiles or if it does not highlight characteristics and sees no difference between users.

Experimental results

We studied the web site of a French IT consulting firm. It is organized is quite a hierarchical way. One part of the web site presents the consulting firm, its references and its special skills. The greater part of the web site (approximately 6000 pages) consists in a number of presentations of several pages each on a given subject. 400 sessions were selected randomly out of approximately 1600 users that consulted the server over a week. This experiment was repeated on several samples and over several time periods. The clusterings obtained are of good quality. It appears that 41% of the population only look at pages of the same presentation: they are clearly searching for something; the users reaching the web site through a search engine are over-represented in these clusters; they find what they are looking for quickly and then leave the web site immediately. However, the population that browses among pages of several presentations is quite large (15% of the population) and views more pages than the average user (10.6 pages per session as opposed to 5 pages per session on average). So, it would be interesting to link related presentations, especially on the inside pages of each presentation, as such links on root pages of presentations already exist and are clearly not followed. Furthermore, these links could be of some help to users who have reached a page that does not correspond to their request.

Other web server logs have been studied, but as yet without great success.

Future work

Future work will be directed towards the modification of the dissimilarity measure; additionally, as a single universal dissimilarity measure that works for all web sites cannot exist, ways of adapting a generic model to each specific web site, so as to audit and transform it, will be studied.

References

T. Yan, M. Jacobsen, H. Garcia-Molina and U. Dayal, ``From User Access Patterns to Dynamic Hypertext Linking'', proceedings of the Fifth International World Wide Web Conference, 1996
O. Nasraoui, R. Krishnapuram and A. Joshi, ``Mining Web Access Logs Using a Fuzzy Relational Clustering Algorithm Based on a Robust Estimator'', proceedings of the Eigth International World Wide Web Conference, 1999
M. Perkowitz and O. Etzioni, ``Adaptive Web Sites: Conceptual Cluster Mining'', International Joint Conferences on Artificial Intelligence , 1999
L. Kaufman and P.J. Rousseeuw, Finding Groups in Data : An Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics, 1990