Okapi Similarity Measurement (Okapi)

Next:Cover Density Ranking (CDR)Up:Combining the HITS-based AlgorithmsPrevious:Vector Space Model (VSM)

Okapi Similarity Measurement (Okapi)

Okapi similarity measurement is one of the most popular methods used in the traditional IR field. Unlike VSM, the Okapi method not only considers the frequency of the query terms, but also the average length of the whole collection and the length of the document under evaluation. In the Okapi method, the similarity between a query

and a document

can also be described as the inner product of the query vector

and the document vector

as follows [13,23]:

$\displaystyle sim_o(q,x_i) = Q \cdot X_i =\sum_{j=1}^{m} v_j \cdot w_{ij}$

(7)

where is the number of unique terms in the document collection; is the frequency of a term in the query ; and $w_{ij}$ is the document weight:

$\textstyle =$

$\displaystyle \frac{f_{ij} \cdot log(\frac{N-d_j+0.5}{d_j+0.5})}{2 \cdot(0.25+0.75\cdot \frac{dl}{avdl})+f_{ij}}$

(8)

where $f_{ij}$ is the term frequency of a term in the document ; is the total number of documents in the collection; is the number of documents in the collection that contain the query term ; is the length of the document (in bytes); and is the average document length in the collection (in bytes).

For reasons similar to the VSM method, the Okapi similarity measurement cannot be applied directly in evaluating the precision of search engines [20]. We need values for and. In our research, we estimate the values of and in the way described in the last section for VSM. In addition, the average length of a Web document () is estimated as to be 10,939 bytes after removing all the HTML tags and Java scripts.

Next:Cover Density Ranking (CDR)Up:Combining the HITS-based AlgorithmsPrevious:Vector Space Model (VSM)

2002-02-18