Determining relevant pages

Given a n-word document a = {w₁, w₂,...w_n} and a set of n recognized words, one can represent q and a each as a vector of word frequencies $\vec{q}\,$ and $\vec{a}\,$ . A common measure of similarity between two word frequency vectors $\vec{a}\,$ and $\vec{q}\,$ weighted by inverse document frequency (idf) is the cosine distance between them:

score( $\displaystyle \bf q$ , $\displaystyle \bf a$ ) = $\displaystyle {\frac{\sum_{w \in q,a} \lambda_{w}^{2} \cdot f_{q}(w) \cdot f_{a... ...\in q} (\lambda_{w}f_{q}(w))^{2} \cdot \sum_{w \in a}(\lambda_{w}f_{a}(w))^2}}}$ ,

where f_d(w) is the number of times word w appears in the document d and $\lambda_{w}^{}$ is the inverse document frequency of the word w defined as:

$\displaystyle \lambda_{w}^{}$ = log $\displaystyle \left(\vphantom{\frac {\vert\cal{D}\vert} {\vert \{ d \in {\cal D} : f_{d}(w) > 0 \} \vert} }\right.$ $\displaystyle {\frac{\vert\cal{D}\vert}{\vert \{ d \in {\cal D} : f_{d}(w) > 0 \} \vert}}$ $\displaystyle \left.\vphantom{\frac {\vert\cal{D}\vert} {\vert \{ d \in {\cal D} : f_{d}(w) > 0 \} \vert} }\right)$

where $\cal {D}$ is the document set in consideration.

Sandeep Pandey 2003-03-05