Link-based ranking methods like PageRank [7] play a crucial role in today's search engines. In this context, such methods indicate the importance of individual Web pages based on the current state of the Web graph. This current state contains all pages and links that were added but not yet removed and is thus the result of the Web's entire evolution. However, methods like PageRank do not properly reflect the evolutionary trajectory of the Web (i.e., links and pages recently removed or added), which is substantial as reported in [2,5,6]. As a consequence, PageRank and the like are not appropriate to serve information needs on timelines and trends as the following example demonstrates.
We use a bibliographic network derived from the Digital Bibliography & Library Project (http://dblp.uni-trier.de) as a showcase here, since obtaining an adequate Web dataset would involve frequent crawling of a significant fraction of the Web. Let us, on the one hand, consider an information need for seminal publications in database research. In our bibliographic network, PageRank identifies E. F. Codd's A Relational Model of Data for Large Shared Data Banks as the most important publication, which is reasonable given this information need. On the other hand, the information need could be for publications in databases that are not yet very important but currently gain a lot of importance; a scenario for which PageRank fails as the example demonstrates. Figure 1 plots PageRank scores of the aforementioned publication and Agrawal et al.'s Mining Association Rules between Sets of Items in Large Databases for the years 1990 through 1999. Although, for any of the depicted times Codd's paper is ahead in terms of importance by an order of magnitude at least, its importance score is close to stagnation. In contrast, the other paper improves its importance score in the considered period by a factor of more than ten.
If the second information need arises at any point between 1993 and 1999, Agrawal et al.'s paper could be identified as a better result by means of the trend contained in its time series of PageRank scores.The BuzzRank method proposed in this work builds on this idea. It analyzes time series of importance scores and quantifies the contained trends based on a growth model of importance scores. Thus, for instance, in a bibliographic network, the method identifies those publications that have significantly increased their importance in a specific time interval, which -more colloquially- are the publications that caused significant buzz in that period. Therefore, BuzzRank's objectives differ from earlier related work [1,3,8] that sought to improve link-based importance ranking by means of temporal features. However, BuzzRank is complementary rather than a replacement to PageRank, and thus seeks to serve information needs as the one above.
BuzzRank exploits the fact that importance scores co-evolve with the Web graph and considers the following time series of importance scores for individual pages.
Let denote the
graph snapshot at time consisting
of the set of nodes and the
set of edges . The vector of
PageRank scores computed on the graph is referred to as . Since PageRank scores are not comparable across
graphs from different points in time (with different graph
sizes), a new kind of normalization problem arises that we solve
as follows. The vector is
normalized dividing by
For an individual node
we consider the time series of importance scores
The growth of PageRank scores over time has been modeled by
Cho et al. [4]
using the logistic growth model (aka. Verhulst growth
model) - a specific case of the following generic growth
model:
We assume for the growth rate that it is time-invariant as within
. Later
this assumption is empirically substantiated. Using the
time-invariant growth rate we obtain the following exponential
growth model for times in the considered time
interval
Since the series of observation times
does not necessarily include
, the value
may be
unknown. Therefore, an additional parameter
is
introduced to the model, so that we obtain the final
model
Using the method of least squares we fit the model to
the observed time series values, i.e., we minimize
This growth rate quantifies the trend in the time series of the node's importance scores and thus, as we argued in the introduction, is a good indicator for the buzz caused by the node in the considered time-interval. BuzzRank provides its final ranking assigning every node its estimated growth rate as a score.
Since no adequate Web dataset is available (i.e., time series of periodically repeated Web crawls), we use the free DBLP bibliographic dataset for our preliminary experiments. We only consider the period from 1989 through 1999 as this period has the most densely recorded citations in DBLP. In the graph that we derive from DBLP nodes represent publications and edges represent citations.
As input to BuzzRank we computed PageRank vectors for the graphs at times corresponding to the begins of the years 1989 through 1999 yielding a total of eleven observations per time series. The damping factor for the PageRank computations was set to .
In the first of our experiments, we empirically analyze our assumption that can be considered as time-invariant. Note that if is nearly constant over a period of successive observations, there must be a strong linear relationship observable between and the log-transformed time series values . Therefore, we computed correlation coefficients for varying across all nodes. Since we are only interested in the strength of the linear relationship but not in its direction, we computed average absolute correlation coefficients for different values of . For three successive observations (i.e., ) this yields a value of . For four up to seven successive observations, slightly lower but consistent values about are observed. Thus, there is a strong linear relationship and therefore assuming time-invariance for is reasonable.
As a second experiment we computed rankings using BuzzRank for
two year intervals (i.e., three successive observations). For
most publications in DBLP only the year of publication is known,
consequently granularities more fine-grained than the year-level
are not meaningful. In Table 1 the
publications top-ranked by BuzzRank for the two year intervals
are given.
The results indicate that BuzzRank indeed brings publications related to hot topics in the respective period to the top. For the intervals and , for instance, publications related to the Web and XML are ranked at the top respectively. In marked contrast, the use of PageRank resulted in the publication by E. F. Codd mentioned in the Motivation to be the top-ranked item in each time interval.
[1] E. Amitay, D. Carmel, et al.
Trend Detection Through Temporal Link Analysis.
JASIST, 55(14):1270-1281, 2004.
[2] Z. Bar-Yossef, A. Z. Broder, et al.
Sic Transit Gloria Telae: Towards an Understanding of the Web's
Decay.
WWW '04.
[3] K. Berberich, M. Vazirgiannis, et al.
Time-aware Authority Ranking.
Internet Mathematics, 2006.
[4] J. Cho, S. Roy, et al.
Page Quality: in Search of an Unbiased Web Ranking.
SIGMOD '05.
[5] D. Fetterly, M. Manasse, et al.
A large-scale study of the evolution of web pages.
Software: Practice and Experience, 34(2):213-237,
2004.
[6] A. Ntoulas, J. Cho, et al.
What's New on the Web?: The Evolution of the Web from a Search
Engine Perspective.
WWW '04.
[7] L. Page, S. Brin, et al.
The PageRank Citation Ranking: Bringing Order to the Web.
Stanford Tech. rep., 1998.
[8] P. S. Yu, X. Li, et al.
On the Temporal Dimension of Search.
WWW '04.