In this paper, we propose a novel link-based similarity measure, called PageSim. Contrast to SimRank, our method can measure similarity between any two web pages, whereas SimRank cannot in some cases.
SimRank is a fixed point of the recursive definition: two
pages are similar if they are linked to by similar pages.
Numerically, for any web page and
, this is specified by defining
and
(1) |
Unfortunately, the result of SimRank is not convincing in some cases. In one case, if one of two web pages has no inlink, then the SimRank score of them is zero by definition, which means they are not similar. However, this is not always true. For example, in Figure 1 of section 4, has no inlink, but it is clear that both and have some similarity with it for they are linked to by . In another case (also in Figure 1), SimRank concludes that and are not similar. In fact, obviously and indeed have some similarity, for they link to each other.
PageSim can be considered as an extension of cocitation algorithm, in which the similarity score between two web pages is defined by the number of inlink neighbors that they have in common. Actually, on the Web, not all links are equally important. For example, if the only common neighbor of page and is the Yahoo home page [1], whereas page and have several common neighbors from obscure places, then which page is more similar to page , page or page ? As we know, hyperlink from web page u to v can be considered as a recommendation of page v by page u [2], and the more important a web page is, the more important its recommendation is. Evidently, the reasonable answer should be page , since the Yahoo home page is much more ``important". In another perspective, the action of recommendation can be considered that page u propagates some kind of similarity to page v, and the more pages it links to, the less similarity it should propagate to each of these pages. Therefore, it is also reasonable to think that the Yahoo home page has some kind of similarity with both page and page .
Since PageRank [5] is one of the most prominent ranking algorithm which assigns global ranking scores to all pages on the Web, we take the PageRank score of a web page as the importance (weight or similarity score) of it in the PageSim method. The intuitions to PageSim model is described as follows, and the mathematical definitions will be given later.
At the beginning, each web page only contains its own similarity score, and then each web page propagates its own similarity score to its outlink neighbors, receiving and propagating the similarity scores of others at the same time. After the propagation, each page contains its own similarity score as well as the similarity scores of others. These scores are stored in a vector called the similarity vector of this page. Then we can calculate the PageSim score of each pair of pages by summing their common similarity scores up.
We model the Web as a directed graph with vertices representing web pages and directed edges representing hyperlinks between web pages. Let and denote the set of inlink pages and outlink pages of respectively, for any . Let denotes a sequence of vertices such that and are distinct, it is called a path from to . Let denotes the length of path , define . Let denotes the set of all possible paths from page to .
A good evaluation of PageSim is difficult without performing extensive user studies or having a reliable external measure of similarity to compare against. In this section, we give a simple example in which PageSim is compared with SimRank to illustrate the performance of PageSim.
For a given graph , where (see Figure 1). Let . We have
The PageSim score matrix is
Let denotes the top
similar pages to page
(excluding ). Let , we
have
The SimRank score matrix of graph is
Thus, we have
We can see that the results of PageSim and SimRank are different. First, SimRank shows that there's no page similar to . While PageSim shows that is most similar to , which is more reasonable. Because the fact that links to implies ``considers" has some level of similarity with it. Secondly, SimRank shows is not similar to , while PageSim shows that is not true. Obviously, and are similar, for they link to each other. Moreover, PageSim considers that is most similar to . SimRank shows is most similar to , for they have a common inlink page . We believe PageSim is the winner in this situation because the ``link to each other" relationship really implies stronger similarity than that of the ``common inlink" relationship.
This paper introduces PageSim, a novel link-based similarity measure. Based on the strategy of PageRank score propagation, PageSim is capable of measuring similarity between any two web pages.
There are numbers of avenues for future work. Foremost, we must address the efficiency issue. For example, the computing time of PageSim is expected to be greatly reduced by limiting the radius of propagation, i.e., the path length of propagation. Especially, when the radius is reduced to 1, PageSim becomes a ``weighted cocitation algorithm". Finally, extensive evaluations of PageSim are needed.
[1] http:/www.yahoo.com.
[2] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology, 1(1):2-43, 2001.
[3] J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999), 31(11-16):1467-1479, 1999.
[4] G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD '02: Proceedings of the eighth ACM SIGKDD, pages 538-543, New York, NY, USA, 2002. ACM Press.
[5] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.