Three-Level Scoring Method (TLS)

Next:ExperimentsUp:Combining the HITS-based AlgorithmsPrevious:Cover Density Ranking (CDR)

Three-Level Scoring Method (TLS)

The TLS method is developed to take human expectation of search results into account and is modeled after existing manual approaches. Many existing manual methods use the following criteria to assign relevance scores [8,9,17]:

Relevant links are links that are related to information needs of a query or have many links to other Web pages which may be useful to the query. They get score 2.
Slightly relevant links are the pages that are partially related to a query. They contain too short a definition to be useful, or contain the introduction of company products that have some technical solutions to a problem relevant to the query, or contain some description to a related textbook. They are given score 1.
Duplicate links are the pages that appear in the returned links with the same URL more than one time, and the links that differ in that one has index.htm or index.html as the suffix and the other does not. They are given score 0. Mirror sites are not counted as duplicates.
Inactive links are the links that give error messages like file not found (404), forbidden, or server not responding (603) errors. These links are given score 0.
Irrelevant links are the links containing irrelevant information to the query. They get score 0.

TLS method computes the relevance of a Web page to a query in the following two steps:

Given a query phrase with terms and a Web page , a raw score is calculated as:

$\begin{displaymath}A (q, x) = \frac{t_n \cdot k^{n-1} + t_{n-1} \cdot k^{n-2} + \cdots + t_1}{k^{n-1}}\end{displaymath}$

(10)

$t_i, 1 \leq i \leq n,$

Convert the raw score to a three-level relevance score $sim_{tls}(q,x)$ through thresholding, with value 2 for relevant, 1 for partially relevant, and 0 for irrelevant:

$\begin{displaymath}sim_{tls} (q, x) = \left \{ \begin{array}{ll}2 & \mbox{if......0 &\mbox{if} \; A(q,x) < \alpha\Theta\end{array} \right .\end{displaymath}$

(11)

$\Theta$

$\alpha$

The TLS method is modeled after the way of how manual evaluation of relevance is commonly done. Its benefit is that it not only gives higher scores to the occurrence of substrings with more distinct terms, but also considers the order of query terms because the changing of the order of query terms may change the meaning of the phrase.

Let's use a simple example to illustrate how the method works. For query ``distributed computing systems'', there are 1 phrase with 3 terms (distributed computing systems), 3 sub-phrases with 2 terms (distributed computing, computing systems and distributed systems), and 3 sub-phrases with 1 term (distributed, computing and systems). Assume that in a given Web page, the number of occurrence of the exact phrase is , the total number of occurrence of the sub-phrases with two terms is , and the total number of occurrence of the sub-phrases with one term is . If the three parameters in the algorithm are set as $\Theta=1$ , $\alpha=0.1$ , and , then we get , and $sim_{tls}=2$ . In the previous work, it is observed that performance evaluation results were not very sensitive to the parameter values of TLS and the default setting, $\Theta=1$ , $\alpha=0.1$ , and , generally worked well [19]. Thus, we used the default setting in the experiments.

Next:ExperimentsUp:Combining the HITS-based AlgorithmsPrevious:Cover Density Ranking (CDR)

2002-02-18