Each participating group has its own twist on the above two components. For example, several groups use collection enrichment [10], in which a much larger document collection is used in the first pass (instead of the target collection) to locate documents for use in the query-expansion process. In yet another enhancement, several groups assume that poorly ranked documents from the first pass are not relevant to the query and use this evidence of non-relevance in the query expansion process [20].
We implement an algorithm which is a scaled-down version of the ad-hoc algorithm used by Singhal et al. in [20]. As described in [23], this algorithm was one of the best performing ad-hoc algorithms at TREC-7. Here are the steps implemented in our algorithm.
Since web collections do have a reasonable number of duplicate documents, to do the query expansion well for the web collections, we have observed that we need to eliminate duplicate documents from the top ten documents used for query expansion. To do this, we retrieve top 100 documents in the first pass, and starting from rank 2 we test if a retrieved document is a duplicate of a previously ranked document. If it is, we remove it from the list. We do this until we get ten distinct documents. Two documents are considered duplicates of each other if they share more than 70% of their vocabulary. We have found this to be a reasonable heuristic for web pages.
To test that our implementation of this TREC ad-hoc algorithm is not broken and is indeed state-of-the-art, we run our system on two recent TREC ad-hoc tasks and compare its precision to the best performing systems at TREC. Since most web queries are short, we want to evaluate the system performance for short queries, and only use the 2-3 words title portion of the TREC queries. Our objective in this study is to do a precision oriented evaluation, we only compare systems based the precision in top ranks. We compare the precision of our system at rank 10 and at rank 20 to corresponding values for the five best performing systems at TREC using title-only queries (these values are available from the detailed results presented in the TREC proceedings, see Appendix A in [21] and [22]). The results are shown in Tables 2 and 3.
Tables 2 and 3 show the precision value for the best TREC systems ordered by decreasing performance. Inserted in that order, is the corresponding precision value for our system. These results show that our system, motivated by a state-of-the-art TREC ad-hoc algorithm is quite competitive with the top performing TREC systems. This is especially true considering the performance gap between the best and the fifth-best system is not very significant. For example, Table 2 indicates that for the TREC-7 ad-hoc task, the best performing system ok7as retrieves on an average 4.86 relevant documents in top 10 for a query, whereas the fifth-best performing system retrieves 4.28. That difference is not very large from a user's perspective.
In summary, these results verify that our implementation of a modern TREC ad-hoc algorithm is not broken and is indeed state-of-the-art. It will be reasonable to say that this system, when run over our fresh web collection, would produce results that will be quite comparable to the results produced by any other top TREC ad-hoc system.