The aim of web clustering is to group web documents into very identifiable classes in such a way that common web activities can be improved. However, due to the huge quantity of web documents, techniques focused on web clustering must be efficient enough to manipulate even millions of elements. Because of that, linear models are generally preferred. One such linear model is the k-means algorithm [4], which can iteratively cluster a document collection into classes.
The traditional k-means algorithm uses the vector model [2] for representing documents. In this model, given a web document collection with documents, every element is represented by a vector in a -dimensional space, typically . Each dimension stands for the frequency of term in the document , where . A dictionary with the terms can be build with the more frequent terms in the collection [6], for example.
The k-means algorithm works by finding centers of gravity , each of which will attract some elements and form a cluster. The center represents cluster . Initially, these centers are randomly selected. Then, iteratively the elements of will be assigned to some cluster. In each phase every element is associated with the nearest center according to some measure . After that, centers are recalculated to minimize the criteria:
There are many distance measures in the literature [7], but Jaccard Extended Distance has showed good clustering performance [6]. This function was used in this work.
This poster presents a variant for the representation of web documents using symbolic objects. The details of some experiments are first showed. Conclusions and future work are left for the final part.
Symbolic objects are vectors where each entry can have any type: scalar, set, interval, histogram, graph, you name it. In the particular case of this poster, the histogram representation was explored.
Each document is represented with a symbolic object with four histogram entries. These four variables correspond to frequency of terms in four sections of the HTML code: text, bold, links and title. Each symbolic object is built after the web collection is analyzed and the most frequent terms are obtained. Previously, are eliminated and Porter's stemming algorithm is applied to every word in any of those four sections.
More formally, each document in the collection is represented by the symbolic object in histogram dimensions . Each variable is a normalized histogram with or .
The distance measure used is based on the affinity index [1] and uses weights for every dimension:
|
|
We used a subset of the web collection from [3] for testing our approach. This series contains 684 documents from 4 classes. The evaluation of the resulting clusters was made using the rand index and the mutual information measures [7]. The former computes how similar is the clustering obtained to the manual clustering (also present in the web collection). The latter calculates the quality of the clusters according to how compact the cluster is. Experiments were repeated 50 times each over database.
The results for web collection clustering are presented in Table 1. The vectorial representation of this series is formed by a 684 x 200 matrix. On the other hand, each symbolic model is accompanied by the number of categories in the histogram, i.e. . Every symbolic representation appears to be slightly better than the vector representation in both indexes. Using 30 categories is a good balance between accuracy and efficiency. Adding more categories to the histograms doesn't improve much more the clustering.
It can be seen in table 1 the average time taken by the different approaches to cluster this series. The symbolic objects show how efficient such representation can be. Using 10 categories in histograms, symbolic objects are near 9 times faster than vector representation.
Table 2 shows the results for this series using an approach called global k-means [5]. This clustering method is completely deterministic and is based on the computation of good initial clusters. This algorithm iteratively builds the best clustering for web collection, using 1 cluster the first time, 2 clusters the second time, until clusters are considered. However, each time it passes throughout all individuals, making global k-means computationally costly. In table 2 it can be appreciated that the best results were obtained by the 30 categories histogram representation, which is again three times faster than the vectorial model.
Figure 1 shows two PCA or principal component analysis [1] over this series. The PCA is a dimension reduction technique. The left graphic was made using the classic representation, while the right graphic was made with symbolic representation. The inertia percentage (how much information is conserved after the transformation) of the classic PCA was 28.89%, but using symbolic PCA the inertia percentage is 62.90%. The graphic on the right doesn't show points, as in the left, but rectangles that represent the variability of concepts.
Besides these results, we also used the quality measures from [8]. The quality of the partition can be decomposed for each variable, to measure the importance of each variable to form the clustering. In one run of this series, using a symbolic object of histograms with 30 categories, the quality indexes for the different variables were: text=0.0348, link=0.0211, bold=0.0215 and title=0.0573. Meaning this that the title in web documents helps a lot in building the clustering.
We are currently working on different distance measures between histograms and a variant of k-means algorithm to take into account what is called strong forms to better clustering a document collection.
In the future, we would like to extend the representation of the symbolic object used in this poster to include more information about the document.