nextupprevious
Next:ExperimentsUp:Combining the HITS-based AlgorithmsPrevious:Cover Density Ranking (CDR)

Three-Level Scoring Method (TLS)

The TLS method is developed to take human expectation of search results into account and is modeled after existing manual approaches. Many existing manual methods use the following criteria to assign relevance scores [8,9,17]: TLS method computes the relevance of a Web page to a query in the following two steps:
  1. Given a query phrase $q$ with $n$ terms and a Web page $x$, a raw score is calculated as:
  2.  
    \begin{displaymath}A (q, x) = \frac{t_n \cdot k^{n-1} + t_{n-1} \cdot k^{n-2} + \cdots + t_1}{k^{n-1}}\end{displaymath} (10)
    where $k$ is a constant, corresponding to the weight for longer sub-phrases;$t_i, 1 \leq i \leq n,$ is the number of occurrence of the sub-phrases of length $i$, i.e., containing $i$ terms. The order of the terms in the sub-phrases should be exactly the same as that in the original query phrase $q$.
  3. Convert the raw score $A (q, x)$ to a three-level relevance score $sim_{tls}(q,x)$ through thresholding, with value 2 for relevant, 1 for partially relevant, and 0 for irrelevant:
  4.  
    \begin{displaymath}sim_{tls} (q, x) = \left \{ \begin{array}{ll}2 & \mbox{if......0 &\mbox{if} \; A(q,x) < \alpha\Theta\end{array} \right .\end{displaymath} (11)
    where $\Theta$ is a constant threshold, representing the requirement for being relevant; $\alpha$ is a value between 0 and 1, representing the requirement for being partially relevant.
The TLS method is modeled after the way of how manual evaluation of relevance is commonly done. Its benefit is that it not only gives higher scores to the occurrence of substrings with more distinct terms, but also considers the order of query terms because the changing of the order of query terms may change the meaning of the phrase.

Let's use a simple example to illustrate how the method works. For query ``distributed computing systems'', there are 1 phrase with 3 terms (distributed computing systems), 3 sub-phrases with 2 terms (distributed computing, computing systems and distributed systems), and 3 sub-phrases with 1 term (distributed, computing and systems). Assume that in a given Web page, the number of occurrence of the exact phrase is $t_3=2$, the total number of occurrence of the sub-phrases with two terms is $t_2=7$, and the total number of occurrence of the sub-phrases with one term is $t_1=18$. If the three parameters in the algorithm are set as $\Theta=1$$\alpha=0.1$, and $k=10$, then we get $ A=(2*10^2+7*10+18)/10^2=2.88$, and $sim_{tls}=2$. In the previous work, it is observed that performance evaluation results were not very sensitive to the parameter values of TLS and the default setting, $\Theta=1$$\alpha=0.1$, and $k=10$, generally worked well [19]. Thus, we used the default setting in the experiments.


nextupprevious
Next:ExperimentsUp:Combining the HITS-based AlgorithmsPrevious:Cover Density Ranking (CDR)
2002-02-18