nextupprevious
Next:Current HITS-based Algorithms andUp:Improvement of HITS-based AlgorithmsPrevious:Improvement of HITS-based Algorithms

Introduction

Kleinberg's hypertext-induced topic selection (HITS) algorithm [16] is a very popular and effective algorithm to rank documents based on the link information among a set of documents. The algorithm presumes that a good hub is a document that points to many others, and a good authority is a document that many documents point to. Hubs and authorities exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs. To run the algorithm, we need to collect a base set, including a root set and its neighborhood, the in- and out-links of a document in the root set.

Because the HITS algorithm ranks documents only depending on the in-degree and out-degree of links, it will cause problems in some cases. For example, Bharat [1] identified two problems: mutually reinforcing relationships between hosts and topic drift. Both problems can be solved or alleviated by adding weights to documents. The first problem can be solved by giving the documents from the same host much less weight [1], and the second problem can be alleviated by adding weights to edges based on text in the documents or their anchors [1,2]. Bharat [1] showed that the simple modification of HITS algorithm for the first problem achieved a remarkable better precision, while further precision can be obtained by adding content analysis. Furthermore, by manipulating the weights of documents, Chang [3] created customized authority by adding more weights to the documents that users are interested in. Farahat [10] improved the precision through Web log records which store the number of users' visits to the documents.

Changing the weight of a document has been shown to be an effective way to improve the precision of HITS algorithm, but these methods need either content analysis [1,2], user feedback [3], or Web log records [10]. As we know, content analysis usually takes a long time, and it is almost impossible to get users' feedback or visiting times for most Web documents. In the paper, we propose a new way to improve HITS-based algorithms' precision by only considering the link information in the base set.

Disregarding the time it may take, combining connectivity and content analysis has been proven to be useful in improving precision. But the similarity measure currently used is vector space model [1] or just a simple occurrence frequency of the query words in the text around the anchors [2], which may not be the best method to evaluate the relevance of Web documents because most queries submitted to search engines are short, consisting of three terms or less [15]. Although we can expand the short queries by adding more related words, expanding itself can cause topic drift.

In this paper, we statistically compare the performance of four relevance scoring methods when they are combined with HITS-based algorithms. Three of them are variations of methods widely used in the traditional information retrieval field. They are cover density ranking (CDR) [7], Okapi similarity measurement (Okapi) [13], and vector space model (VSM) [21]. In addition, the fourth one is called three-level scoring method (TLS) [19], which mimics commonly used manual similarity measuring approaches. In comparing the performance of the four relevance scoring methods, we apply two statistical metrics: expected loss [4] and probably approximately correct [5].

This paper is organized as follows. In Section 2, we discuss the limitation of current HITS-based algorithms. In Section 3, we propose a new method to improve the HITS algorithm by assigning appropriate weights to in-links of documents in the root set. In Section 4, we present the ways of combining four different relevance scoring methods with HITS-based algorithms. In Section 5, we show the experimental results. Finally, in Section 6, we summarize the paper.


nextupprevious
Next:Current HITS-based Algorithms andUp:Improvement of HITS-based AlgorithmsPrevious:Improvement of HITS-based Algorithms
2002-02-18