Exploring Link Topology for Associating Web Pages

Quoc Vu and Wen-Syan Li

C&C Research Laboratories, NEC USA, Inc.
110 Rio Robles M/S SJ100, San Jose, CA 95134, USA
Email: {qvu,wen}@ccrl.sj.nec.com

Introduction

Many systems have been proposed to find the Web document association, such as the companion and co-citation algorithms proposed by Dean and Henzinger[1] and the Netscape algorithm[2] used to implement the What's Related? functionalities. In contrast, we are interested in inducing the reasons why they are associated. We focus on the problem "for a given set of pages, find a list of Web pages which reflect the association of such two pages." We now use an example to illustrate the problem statement and overview of our approach.

Figure 1: Link structure connecting and associating the Web pages of `W.Li` and `D.Agrawal`

In Figure 1, we show a subset of links between two personal Web pages W.Li and D.Agrawal (both are highlighted in the figure). The purpose of each link is indicated as its label. For example, W.Li graduated from Northwestern University; whereas D.Agrawal and P.Scheuermann both graduated from SUNY. The associations between W.Li and D.Agrawal are implicitly expressed in the link structure connecting W.Li and D.Agrawal by many Web authors associated with W.Li and D.Agrawal. Below, we enumerate some associations that can be derived from this graph, and some possible interpretations for such associations:

Association 1: Web8 paper page appears in a path of distance of 2 connecting the pages W.Li and D.Agrawal. Therefore, W.Li and D.Agrawal may be associated due to a co-authored paper.
Association 2: Y.Wu page is on two paths related to NEC Research Laboratories, each of distance 4. W.Li and D.Agrawal may be associated due to the fact they both supervised Y.Wu at different occasions or they participate in the same project at NEC.

Framework

Obviously, the links connecting these pages are not intended to express such associations and the authors were not coordinated to make the links and anchors semantics consistent across the Web. However, we can assume and use the following two intuitions:

Path length factor: Pages on a shorter path between W.Li and D.Agrawal are stronger indicators to reflect why the W.Li and D.Agrawal pages are associated than those on a longer path.
Connectivity factor: Pages which appear on more paths should be stronger indicators than others to reflect why the W.Li and D.Agrawal pages are associated.

Note that a page with a higher connectivity (i.e. more incoming links and outgoing links) is more likely to be included in more paths; consequently, such a page is more likely to be ranked higher according to the above criteria. This is consistent with the principle of topic distillation[3, 4].

Links have been used in many fields to associate documents. In this paper, we consider three type of relationships:

co-citation: two documents are relevant to each other if they are linked by a common document.
social filtering: two documents are relevant to each other if they link to a common document.
transitivity: two documents are relevant to each other if there is a path from one to another.

Scoring Functions

The scoring functions are used to measure how well a page in inducing the association of the given pages. The first scoring function is used to measure the significance of a node in both the connectivity factor and the distance factor.

To derive metrics for nodes selection, one straight forward candidate metric, that adjusts connectivity scores by distance would be

where paths(A,B,v) is the set of paths between the given pages, A and B, that pass through a candidate v, and length(p) is the length of the path p.

We are also interested in nodes appearing in many short paths but not many long paths. Based on the previous scoring function, this type of nodes may not receive high scores. However, such exceptions are also interesting since such nodes are located in a more isolated sub-graph. Thus we design another scoring function for exceptions as follows:

Concluding remarks

In contrast to other work focusing on finding related pages for a given page, we deal with the problem "for a given pair of pages, find a list of Web pages which reflect the association of such two pages." We design two scoring functions for measuring different aspects of page associations. We have developed and implemented the exploration algorithms and experimentally evaluate the algorithms on real Web data. Future work includes conducting experiments on real Web data to evaluate the effectiveness of content focused algorithm. We are also interested in applying this work to workflow management to identify the bottle neck and critical points in the workflow.

References

Jeffrey Dean and Monika Henzinger, Finding Related Pages in the World Wide Web, In Proceedings of the 8th World-Wide Web Conference, Toronto, Canada, May 1999.
Netscape Communications Corporation, What's Related web page, Information available at http://home.netscape.com/netscapes/related/faq.html.
Jon M. Kleinberg, Authoritative sources in a hyperlinked environment, In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pages 668-677, January 1998.
Wen-Syan Li and Selcuk Candan, Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management, To appear in ACM Computing Survey, 2000.