One of the main aims of TREC web track has been to answer the question if link-based methods are better than keyword-based methods for web search. Most results coming out of the web track indicate that (as measured under TREC) link-based methods do not have any advantage over a state-of-the-art keyword-based TREC ad-hoc algorithm. For example, according to Hawking et al. in [6]:
...results are presented for an effectiveness comparison of six TREC systems ...against five well-known Web search systems .... These (results) suggest that the standard rankings produced by the public web search engines is by no means state-of-the-art.
...all five (public web) search engines performed below the median P@20 for (short) title-only (TREC) VLC2 submissions ...Also, in [7] Hawking et al. say that
...Little benefit was derived from the use of link-based methods for standard TREC measures on the WT2g collection. ...One group investigated the use of PageRank scores and found no benefit on standard TREC measures. ...On a similar note, Savoy and Picard in [17] say that as implemented in their study:
...Hyper-links do not result in any significant improvement ...Overall, the sentiment in [5,6,7,17] is that when applied to web search, state-of-the-art keyword-based techniques used in TREC ad-hoc systems are as effective as link-based methods. Hawking et al. do accompany these counter-intuitive results with several shortcomings of the TREC environment that might be causing them. For example, in [7] they say:
...The number of inter-server links within WT2g may have been too small or it may be that link-based methods would have worked better with different types of queries and/or with different types of relevance judgments. ...
These caveats to the results presented in [5,6,7] are the main focus of this study. We observe the following shortcomings of the evaluations done in the TREC web track, and design a new evaluation which is aimed at removing these shortcomings to study the effectiveness of link-based vs. keyword-based search algorithms again:
For example, in doing an in-house pilot study, we found that for the query ``new york city subway'' (posed by one of our users) our TREC ad-hoc algorithm retrieved eighteen out of the top twenty pages from the site www.nycsubway.org, and all were judged relevant by our user. Most commercial search engines realize that this is not very desirable from the users' perspective, once on the site www.nycsubway.org, users like browsing the pages on that site themselves. Therefore, most commercial search engines group the results by site. Page based precision measurement tends to favor TREC ad-hoc algorithms which can retrieve twenty pages, all relevant, from a single site. On the other hand, site-based grouping done by most commercial web search engines artificially depresses the precision value for these engines (as measured under TREC) because it groups several relevant pages under one item and fills the list of ranks by other, possibly non-relevant, sites.
The problem that all relevant documents are not pertinent to a user is a long standing problem in retrieval evaluation [3]. Since pertinence is hard to quantify, most retrieval evaluations just use document relevance as the evaluation criteria. The web search engines, and in our opinion rightly so, take the view that multiple pages from the same site, even though relevant, are less pertinent as compared to relevant pages from different sites. The TREC evaluations ignore this aspect.
These shortcomings of previous work do not give us confidence that the results from these studies will hold in a realistic, more recent web search environment. In the following section, we describe our experimental environment which is aimed at removing these shortcomings and evaluating the effectiveness of current link-based web search engines vs. a state-of-the-art keyword-based TREC ad-hoc algorithm.