Next:ExperimentsUp:Combining
the HITS-based AlgorithmsPrevious:Cover
Density Ranking (CDR)
Three-Level Scoring Method (TLS)
The TLS method is developed to take human expectation of search
results into account and is modeled after existing manual approaches. Many
existing manual methods use the following criteria to assign relevance
scores [8,9,17]:
-
Relevant links are links that are related to information needs of a query
or have many links to other Web pages which may be useful to the query.
They get score 2.
-
Slightly relevant links are the pages that are partially related to a query.
They contain too short a definition to be useful, or contain the introduction
of company products that have some technical solutions to a problem relevant
to the query, or contain some description to a related textbook. They are
given score 1.
-
Duplicate links are the pages that appear in the returned links with the
same URL more than one time, and the links that differ in that one has
index.htm or index.html as the suffix and the other does not. They are
given score 0. Mirror sites are not counted as duplicates.
-
Inactive links are the links that give error messages like file not found
(404), forbidden, or server not responding (603) errors. These links are
given score 0.
-
Irrelevant links are the links containing irrelevant information to the
query. They get score 0.
TLS method computes the relevance of a Web page to a query in the
following two steps:
-
Given a query phrase
with
terms and a Web page ,
a raw score is calculated as:
|
(10) |
where
is a constant, corresponding to the weight for longer sub-phrases;
is the number of occurrence of the sub-phrases of length ,
i.e., containing
terms. The order of the terms in the sub-phrases should be exactly the
same as that in the original query phrase .
-
Convert the raw score
to a three-level relevance score
through thresholding, with value 2 for relevant, 1 for partially relevant,
and 0 for irrelevant:
|
(11) |
where
is a constant threshold, representing the requirement for being relevant;
is a value between 0 and 1, representing the requirement for being partially
relevant.
The TLS method is modeled after the way of how manual evaluation
of relevance is commonly done. Its benefit is that it not only gives higher
scores to the occurrence of substrings with more distinct terms, but also
considers the order of query terms because the changing of the order of
query terms may change the meaning of the phrase.
Let's use a simple example to illustrate how the method works. For query
``distributed computing systems'', there are 1 phrase with 3 terms
(distributed computing systems), 3 sub-phrases with 2 terms (distributed
computing, computing systems and
distributed systems),
and 3 sub-phrases with 1 term (distributed, computing and
systems).
Assume that in a given Web page, the number of occurrence of the exact
phrase is ,
the total number of occurrence of the sub-phrases with two terms is ,
and the total number of occurrence of the sub-phrases with one term is .
If the three parameters in the algorithm are set as , ,
and ,
then we get ,
and .
In the previous work, it is observed that performance evaluation results
were not very sensitive to the parameter values of TLS and the default
setting, , ,
and ,
generally worked well [19]. Thus, we used
the default setting in the experiments.
Next:ExperimentsUp:Combining
the HITS-based AlgorithmsPrevious:Cover
Density Ranking (CDR)
2002-02-18