Many systems have been proposed to find the Web document
association, such as the companion and
co-citation algorithms proposed by Dean and
Henzinger[1] and the
Netscape algorithm[2]
used to implement the What's Related? functionalities.
In contrast, we are interested in inducing
the reasons why they are associated.
We focus on the problem "for a given set of pages, find
a list of Web pages which reflect the association of such
two pages." We now use an example to illustrate the problem
statement and overview of our approach.
Figure 1:
Link structure connecting and associating the Web pages of
W.Li and D.Agrawal
In Figure 1, we show a subset of links between two personal Web pages W.Li and D.Agrawal (both are highlighted in the figure). The purpose of each link is indicated as its label. For example, W.Li graduated from Northwestern University; whereas D.Agrawal and P.Scheuermann both graduated from SUNY. The associations between W.Li and D.Agrawal are implicitly expressed in the link structure connecting W.Li and D.Agrawal by many Web authors associated with W.Li and D.Agrawal. Below, we enumerate some associations that can be derived from this graph, and some possible interpretations for such associations:
Obviously, the links connecting these pages are not intended to express such associations and the authors were not coordinated to make the links and anchors semantics consistent across the Web. However, we can assume and use the following two intuitions:
Note that a page with a higher connectivity (i.e. more incoming links and outgoing links) is more likely to be included in more paths; consequently, such a page is more likely to be ranked higher according to the above criteria. This is consistent with the principle of topic distillation[3, 4].
Links have been used in many fields to associate documents. In this paper, we consider three type of relationships:
The scoring functions are used to measure how well a page in inducing the association of the given pages. The first scoring function is used to measure the significance of a node in both the connectivity factor and the distance factor.
To derive metrics for nodes selection, one straight forward
candidate metric, that adjusts connectivity scores by distance
would be
We are also interested in nodes appearing in many short
paths but not many long paths.
Based on the previous scoring function, this type of nodes
may not receive high scores. However, such exceptions
are also interesting since such nodes are located in a more
isolated sub-graph. Thus we design another scoring function
for exceptions as follows:
In contrast to other work focusing on finding related pages for a given page, we deal with the problem "for a given pair of pages, find a list of Web pages which reflect the association of such two pages." We design two scoring functions for measuring different aspects of page associations. We have developed and implemented the exploration algorithms and experimentally evaluate the algorithms on real Web data. Future work includes conducting experiments on real Web data to evaluate the effectiveness of content focused algorithm. We are also interested in applying this work to workflow management to identify the bottle neck and critical points in the workflow.