Given a n-word document a = {w1, w2,...wn} and a set of n recognized words, one can represent q and a each as a vector of word frequencies and . A common measure of similarity between two word frequency vectors and weighted by inverse document frequency (idf) is the cosine distance between them:
where fd(w) is the number of times word w appears in the document d and is the inverse document frequency of the word w defined as:= log |