No longer are web sites merely showcases for the companies they represent. They are actually useful, both to users and to their instigators. As such, web sites should have two objectives: first, allow users to find the information they are looking for -- or allow them to enjoy browsing through the web site; secondly, attain their instigators' aims -- sell products, broaden users' base and win the loyalty of users. How can it be known whether or not a web site attains these two objectives?
Methods used at the present time -- that is a web professional evaluating a series of the web site's characteristics or users filling in an on-line questionnaire -- have three drawbacks:
On the other hand, various data mining techniques have been applied to web server logs to create adaptive web sites: web sites that automatically improve their organization and presentation by learning from user access patterns. While this aim is very large, only specific tasks, such as index page synthesis, linking similar pages or highlighting links to popular pages, have been considered.
These are the reasons why we are presenting another method for auditing and proposing a complete redesign for a web site. By mining user access logs using a fuzzy clustering algorithm, we can identify user profiles whose interpretation gives us recommendations for changing the web site. This method compensates for the three drawbacks mentioned above. By drawing a sample population with a simple random algorithm, we ensure that the sample population is representative of the whole population visiting the web site. By collecting data directly from the web server log, we are not subjected to any psychological user bias. Because an undirected knowledge discovery algorithm is used, the results do not depend on any a priori web design that we impose.
In order to use a clustering algorithm to create user profiles, we need a dissimilarity measure to determine if two user sessions are similar. We have built a dissimilarity measure based on:
A strong emphasis is placed on pages that the user looks at longer than the average user. This choice is less arbitrary than considering a session as being characterized by the content pages or the pages browsed among by the end of the session.
A fairly standard fuzzy clustering algorithm was used. The aim of fuzzy clustering is not to make a clear-cut decision for each session and determine which cluster it belongs to. It is rather to determine the distribution of the membership coefficients of each session to each cluster. This operation allows for some ambiguity in the data, by saying ``this session belongs for the most part to this cluster'' or ``that session is divided equally between those two clusters''. We chose a fuzzy clustering algorithm because we needed to know if the distance we had built was discriminating. Fuzzy clustering enables us to know how fuzzy the classification is, that is, if the chosen distance really emphasizes the differences between several user profiles or if it does not highlight characteristics and sees no difference between users.
We studied the web site of a French IT consulting firm. It is organized is quite a hierarchical way. One part of the web site presents the consulting firm, its references and its special skills. The greater part of the web site (approximately 6000 pages) consists in a number of presentations of several pages each on a given subject. 400 sessions were selected randomly out of approximately 1600 users that consulted the server over a week. This experiment was repeated on several samples and over several time periods. The clusterings obtained are of good quality. It appears that 41% of the population only look at pages of the same presentation: they are clearly searching for something; the users reaching the web site through a search engine are over-represented in these clusters; they find what they are looking for quickly and then leave the web site immediately. However, the population that browses among pages of several presentations is quite large (15% of the population) and views more pages than the average user (10.6 pages per session as opposed to 5 pages per session on average). So, it would be interesting to link related presentations, especially on the inside pages of each presentation, as such links on root pages of presentations already exist and are clearly not followed. Furthermore, these links could be of some help to users who have reached a page that does not correspond to their request.
Other web server logs have been studied, but as yet without great success.
Future work will be directed towards the modification of the dissimilarity measure; additionally, as a single universal dissimilarity measure that works for all web sites cannot exist, ways of adapting a generic model to each specific web site, so as to audit and transform it, will be studied.