Traditional link analysis approaches assume equal weights assigned to different links and pages. In original PageRank formulation, the user model assumes that the user has equal probability to follow each link from a given page, thus the score of a page equally affects all of the pages it points to. It also assumes that the probability for a user to go to a URL directly without following a link is the same for all URLs. In this paper, we investigate different weighting schemes that take into account the probability to go to a page directly (by typing or using bookmarks), as well as the relative probability to follow a link from a given page. Both of these probabilities can be approximated from usage logs if they are available. We introduce a natural extension to the original PageRank formulation that we will call Usage aware PageRank (UPR). The new formulation combines static link structure graph with the usage graph that will be obtained via web logs or other means. It is also quite general; how much emphasis will be given to the graphs is controlled by a parameter. If the parameter is set to zero, the algorithm becomes equivalent to the original PageRank, if it is set to one, the emphasis shifts to the usage graph, and for values in between, both of the graphs will be used with weights specified by the parameter. UPR is also quite inexpensive. After a onetime precalculation step, an iteration of UPR takes about the same time as a PageRank iteration.
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval — search process, information retrieval, information filtering
Algorithms, performance, experimentation
Pagerank extension, usage statistics, link analysis, UPR, usage aware pagerank
Original PageRank formulation focuses on the static structure of the site, completely ignoring usage. However, users visiting web pages either directly or by following a link can be considered as an implicit indication of the importance of these pages. Usage aware PageRank formulation is an extension to the original PageRank which makes use of usage statistics as well as the static structure. The PageRank formula given in the original paper is as follows:
|
(1) |
where T1 ¼Tn are pages pointing to page A, the parameter d is the damping factor, and C(A) is the number of links going out of page A. A user will access a page either following a link, or by typing the URL (or using bookmarks). We can think of the damping factor as the probability that a user will follow a link, instead of typing the URL. Note that, PageRanks of all pages will add up to the total number of pages (n) using the above formula. By dividing (1-d) term by n, we can obtain a true probability distribution for PageRanks. The formula is based on a random walk model, if the user chooses to follow a link, all the links on the page the user is looking at have equal chances of being clicked on. Similarly, if the user chooses not to follow a link and to start over, all the pages have equal chance of being accessed. In some sense, PageRank of a page shows the probability that a random walk user will visit a particular page. In this formulation, a page's PageRank contributes equally to the PageRanks of the pages it points to, normalized by total number of outgoing links from that page.
The random walk user in the original PageRank formulation does not differentiate between links on a given page while deciding which one to follow, nor does it differentiate between pages when it decides to start from a new page. Consider a scenario where we have a web page, P1, which is bookmarked by various users, therefore visited more often than other pages. Consider a link on that page, L12, that points to page P2, which is followed by majority of its visitors, but other links on page P1 are hardly used. Intuitively, the probability of users visiting page P1 is higher than the probability of users visiting other pages. Similarly, PageRank of P1 should contribute more to the PageRank of P2, since L12 is followed more often than other links on P1. Obviously, the random walk model doesn't capture these differences. We extend the PageRank formulation by taking usage statistics into account. The Usage aware PageRank, UPR, of a page is defined as follows:
|
(2) |
where TSi are the pages Ti that contain hyperlinks (structural links) to page A, C(TSi) is the number of hyperlinks on Ti, TUi are the pages that users followed a link from to go to page A, W(TUi® A) is the total weight of link traversals from page Ti to page A, W(TUi) is the total weight of all link traversals from page Ti, Wno-follow(A) is the total weight of accesses to page A without following a link divided by total weight of all such accesses, d is the damping factor, n is the total number of pages, and a is the emphasis given to the usage statistics. The simplest way of assigning weights to page / link accesses is to use counts obtained from the logs, in which case, UPR with emphasis only on the usage graph, will have similar interpretation as PageRank, but with accurate probability estimates. When a is equal to zero, UPR becomes equivalent to PR. In order to avoid sink effects, i.e., guarantee that UPR of all pages add up to 1, for pages that don't contain any hyperlinks, it is possible to create artificial links to all other pages.
In some sense, UPR is based on a "biased random walk model". The biased random walk user distinguishes between pages as well as links. He/she prefers some links over others (perhaps the chosen link is more visible, placed towards the top, or has a better anchor text), and prefers some pages over others (perhaps the page is a good "hub" to start from, it is in his bookmarks, or has a short and easy to type URL).
Note that, once a value of a is fixed, the formula can further be simplified by precomputing the matrix that will be used in the iterations, in which case number of computations required for each subsequent iteration will be about the same as a regular PageRank iteration. The formula can be rearranged as follows:
|
(3) |
Note that only the UPR(Ti) term in the summation changes during iterations, once a is fixed, the remaining part can be incorporated in a precalculated matrix.
In our formulation, we used an emphasis factor, a, that represents the relative importance of the usage graph to the structure graph. A slight variation of the formulation can be obtained by using two separate emphasis factors for the two components of the formula corresponding to the probability that a user will type a URL, and the probability that a user will follow a link on a given page. We can adjust the weights of the structure graph and the usage graph separately for these two components. Deemphasizing the usage statistics about going to a page directly may be desirable in some cases, for instance, when we want to reduce the effects of browser home pages (which may get unintentional hits every time a user opens up the browser), and when we do not have full usage statistics for a subset of pages. Note that, weights of links going out of a given page is normalized for each page in our formulation. If we do not have usage statistics about the links of a given page, all of them can be treated equally. However, if we assign initial weights reflecting the number of hits on a page without following a link, portions of the dataset that do not contain usage statistics will be penalized. Note that a similar situation occurs if the usage sampling rate for different domains or sites are different; these pages will be assigned low initial scores. By splitting the parameter a into two, it may be possible to reduce such undesirable effects in presence of partial or uneven information, without affecting the second portion of the formula. The modified formula is very similar to the original formula, and is as follows:
|
(4) |
where a1 and a2 are the new emphasis parameters, separated for the two portions of the formula. Similarly, iterations using this formula too can be optimized as discussed before (for a given value of a2).
There are certain types of user behavior that may reduce the accuracy of the probability estimates obtained from the logs. Every time a user opens up a browser, the browser's home page gets a hit, even if the user's intent is to visit another page. Also, equal number of hits to a page from several users should be weighted higher than a user hitting the same page several times, so that our estimates better reflect the behavior of all users. Similarly, several users following a link should be weighted higher than a user following the same link several times. In the UPR formulation, instead of using counts of links followed, and counts of hits to pages with empty referrer fields, we can deemphasize successive accesses to pages / links from the same user in a time window by using modified counts which is a log transform of the counts:
|
(5) |
Note that if there were no accesses to a particular page, the modified count would be lg(1+0)=0, if there was a single access, the modified count would be lg(1+1)=1, and the modified count would lower the weight of subsequent accesses from the same user. After the transformation, adding up all the modified counts from all users for all time windows will give us estimates that better reflect overall user behavior.
We applied regular PR and UPR with simple counts as well as modified counts to our department's web site for which we have full usage statistics. Although characteristics for intranet vs internet settings can be significantly different, preliminary results show that UPR is promising on the site level and helps alleviate some of the problems associated in applying PageRank on a single site. These results will be presented in the poster session. Some of the results and a longer paper is also available as a the technical report at http://www.cs.umn.edu, report number 03-010.