In this paper, we propose a novel link-based similarity measure, called PageSim. Contrast to SimRank, our method can measure similarity between any two web pages, whereas SimRank cannot in some cases.
SimRank is a fixed point of the recursive definition: two
pages are similar if they are linked to by similar pages.
Numerically, for any web page and
, this is specified by defining
and
![]() |
(1) |
Unfortunately, the result of SimRank is not convincing in some
cases. In one case, if one of two web pages has no inlink, then
the SimRank score of them is zero by definition, which means they
are not similar. However, this is not always true. For example,
in Figure 1
of section 4, has no inlink, but it is
clear that both
and
have some similarity with it
for they are linked to by
. In
another case (also in Figure 1), SimRank
concludes that
and
are not similar. In fact, obviously
and
indeed have some similarity, for they link to each
other.
PageSim can be considered as an extension of cocitation
algorithm, in which the similarity score between two web pages is
defined by the number of inlink neighbors that they have in
common. Actually, on the Web, not all links are equally
important. For example, if the only common neighbor of page
and
is the Yahoo home page [1],
whereas page
and
have several common neighbors from obscure places,
then which page is more similar to page
, page
or page
? As we know, hyperlink from web page u to
v can be considered as a recommendation of page v
by page u [2], and the
more important a web page is, the more important its
recommendation is. Evidently, the reasonable answer should be
page
, since the Yahoo home page is
much more ``important". In another perspective, the action of
recommendation can be considered that page u
propagates some kind of similarity to page v, and
the more pages it links to, the less similarity it should
propagate to each of these pages. Therefore, it is also
reasonable to think that the Yahoo home page has some kind of
similarity with both page
and page
.
Since PageRank [5] is one of the most prominent ranking algorithm which assigns global ranking scores to all pages on the Web, we take the PageRank score of a web page as the importance (weight or similarity score) of it in the PageSim method. The intuitions to PageSim model is described as follows, and the mathematical definitions will be given later.
At the beginning, each web page only contains its own similarity score, and then each web page propagates its own similarity score to its outlink neighbors, receiving and propagating the similarity scores of others at the same time. After the propagation, each page contains its own similarity score as well as the similarity scores of others. These scores are stored in a vector called the similarity vector of this page. Then we can calculate the PageSim score of each pair of pages by summing their common similarity scores up.
We model the Web as a directed graph with vertices
representing web pages
and directed
edges
representing hyperlinks between
web pages. Let
and
denote the set of inlink pages and
outlink pages of
respectively, for any
. Let
denotes a sequence
of vertices
such that
and
are distinct, it is called a
path from
to
. Let
denotes the length of path
, define
. Let
denotes the set of all
possible paths from page
to
.
A good evaluation of PageSim is difficult without performing extensive user studies or having a reliable external measure of similarity to compare against. In this section, we give a simple example in which PageSim is compared with SimRank to illustrate the performance of PageSim.
For a given graph , where
(see
Figure 1).
Let
. We
have
The PageSim score matrix is
Let denotes the top
similar pages to page
(excluding
). Let
, we
have
The SimRank score matrix of graph is
Thus, we have
We can see that the results of PageSim and SimRank are
different. First, SimRank shows that there's no page similar to
. While PageSim shows that
is most similar to
, which is more reasonable.
Because the fact that
links to
implies
``considers"
has
some level of similarity with it. Secondly, SimRank shows
is not similar to
, while PageSim shows that is
not true. Obviously,
and
are similar, for they link
to each other. Moreover, PageSim considers that
is most similar to
. SimRank shows
is most similar to
, for they have a common
inlink page
. We believe PageSim is
the winner in this situation because the ``link to each other"
relationship really implies stronger similarity than that of the
``common inlink" relationship.
This paper introduces PageSim, a novel link-based similarity measure. Based on the strategy of PageRank score propagation, PageSim is capable of measuring similarity between any two web pages.
There are numbers of avenues for future work. Foremost, we must address the efficiency issue. For example, the computing time of PageSim is expected to be greatly reduced by limiting the radius of propagation, i.e., the path length of propagation. Especially, when the radius is reduced to 1, PageSim becomes a ``weighted cocitation algorithm". Finally, extensive evaluations of PageSim are needed.
[1] http://www.yahoo.com.
[2] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology, 1(1):2-43, 2001.
[3] J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999), 31(11-16):1467-1479, 1999.
[4] G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD '02: Proceedings of the eighth ACM SIGKDD, pages 538-543, New York, NY, USA, 2002. ACM Press.
[5] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.