aIBM Almaden Research Center K53,
650 Harry Road, San Jose, CA 95120, U.S.A.
bComputer Science Division,
Soda Hall University of California, Berkeley, CA 94720, U.S.A.
cDepartment of Computer Science,
Upson Hall Cornell University, Ithaca, NY 14853, U.S.A.
The subject of this paper is the design and evaluation of an automatic resource compiler. An automatic resource compiler is a system which, given a topic that is broad and well-represented on the Web, will seek out and return a list of Web resources that it considers the most authoritative for that topic. Our system is built on an algorithm that performs a local analysis of both text and links to arrive at a "global consensus" of the best resources for the topic. We describe a user-study, comparing our resource compiler with commercial, human-compiled/assisted services. To our knowledge, this is one of the first systematic user-studies comparing the quality of multiple Web resource lists compiled using different methods. Our study suggests that, although our resource lists are compiled wholly automatically (and despite being presented to users without any embellishments in the "look and feel" or the presentation context), they fare relatively well compared to the commercial human-compiled lists.
When Web users seek definitive information on a broad topic, they frequently go to a hierarchical, manually-compiled taxonomy such as Yahoo!, or a human-assisted compilation such as Infoseek. The role of such a taxonomy is to provide, for any broad topic, such a resource list with high-quality resources on the topic. In this paper we describe ARC (for Automatic Resource Compiler), a part of the CLEVER project on information retrieval at the IBM Almaden Research Center. The goal of ARC is to automatically compile a resource list on any topic that is broad and well-represented on the Web. By using an automated system to compile resource lists, we obtain a faster coverage of the available resources and of the topic space than a human can achieve (or, alternatively, are able to update and maintain more resource lists more frequently). As our studies with human users show, the loss in quality is not significant compared to manually or semi-manually compiled lists.
The use of links for ranking documents is similar to work on citation analysis in the field of bibliometrics (see e.g. [White and McCain]). In the context of the Web, links have been used for enhancing relevance judgments by [Rivlin, Botafogo, and Schneiderman] and [Weiss et al]. They have been incorporated into query-based frameworks for searching by [Arocena, Mendelzon, and Mihaila] and by [Spertus].
Our work is oriented in a different direction namely, to use links as a means of harnessing the latent human annotation in hyper-links so as to broaden a user search and focus on a type of `high-quality' page. Similar motivation arises in work of [Pirolli, Pitkow, and Rao]; [Carriere and Kazman]; and Brin and Page [BrinPage97]. Pirolli et al. discuss a method based on link and text-based information for grouping and categorizing WWW pages. Carriere and Kazman use the number of neighbours (without regard to the directions of links) of a page in the link structure as a method of ranking pages; and Brin and Page view Web searches as random walks to assign a topic-independent "rank" to each page on the WWW, which can then be used to re-order the output of a search engine. (For a more detailed review of search engines and their rank functions, including some based on the number of links pointing to a Web page, see Search Engine Watch [SEW].) Finally, the link-based algorithm of Kleinberg [Kleinberg97] serves as one of the building blocks of our method here; this connection is described in more detail in Section 2 below, explaining how we enhance it with textual analysis.