This paper gives an overview of the evaluation design of the Web Retrieval Task in the Third NTCIR Workshop that is currently in progress. In the Web Retrieval Task, we try to compare the retrieval effectiveness of each Web search engine system using a common data set, and to build a re-usable test collection suitable for evaluating Web search engine systems. With these objectives, we have built 100 Gigabyte and 10 Gigabyte document sets, mainly gathered from the '.jp' domain. Relevance judgment is performed on the retrieved documents, which are written in Japanese or English.
Test collection, Evaluation methodology, Web search engines
Several evaluation workshops have recently been held and watched with keen interest. The evaluation workshops aim to compare the effectiveness of various systems for a particular task on a common basis. TREC [1] and NTCIR Workshop [2] were evaluation workshops for information retrieval (IR) and its related areas, and they compared the retrieval effectiveness of IR systems using a common test collection. Here, a test collection means a benchmark for experimental evaluation of IR systems, and is composed of (1) the document set, (2) the topics and (3) the list of relevant judgments for each topic.
The objective of the Web Retrieval Task in the Third NTCIR Workshop (hereafter 'NTCIR Web Task') [3] is 'to research the retrieval of Web documents that have a structure with tags and links, and that are written in Japanese or English'. Task design and evaluation methods are considered from the viewpoint of features of Web retrieval. This paper describes the evaluation design of the NTCIR Web Task.
Past TREC Web Tracks [4-5] have used a data set extracted from 'the Internet Archive' [6] as document sets, and the relevance of each document in the submitted results has been judged, the result being written in English only. In the NTCIR Web Task, we have prepared two types of document sets, mainly gathered from the '.jp' domain: one is over 100 Gigabytes and the other is a selected 10 Gigabytes. Almost all the documents seem to be written in Japanese or English. Participants will only be allowed to use the original document sets inside the National Institute of Informatics (NII), since the data set size is too large to handle easily and there are some restrictions on the delivery of the original data. Participants will use the computer resources in the 'Open-Lab' located at NII to perform data processing, e.g. indexing of the original document data, and will then take the resulting data and perform experiments on them in their own laboratories.
The topic format is basically inherited from one of the past NTCIRs except for the definition of <TITLE>, <RDOC> and <USER>, and the format of <NARR>. The usable fields and mandatory fields vary according to the subtasks described in 4. Each field surrounded by a pair of tags has the following meaning:
All of the topics are written in Japanese. A topic example translated into English is shown in Figure 1.
<TOPIC>Usually the assessor who created the topic judges the relevance of each element in the document pool, which is composed of the top-ranked search results from each participant's search engine system. For the dry run, when the assessor judges the relevance of a page, we allow him to refer to the contents of the out-linked pages on condition that they are included in the document pool. We will analyze the dry run results and reconsider the method of relevance judgments for the formal run.
The Web Retrieval Task is composed of the following subtasks for the two document sets: 100 Gigabytes and 10 Gigabytes.
'Survey Retrieval' is similar to traditional ad-hoc retrieval of scientific documents or newspapers. Both recall and precision are emphasized for evaluation. The Survey Retrieval Subtask is devided into the two following subtasks: retrieval using topic sentences and terms (hereafter 'Topic Retrieval'), and that using given relevant documents (hereafter 'Similarity Retrieval'). The participants in the Topic Retrieval Subtask must submit the run results using the respective <TITLE> and <DESC>, and may optionally use the other topic fields. The participants in the Similarity Retrieval Subtask must use at least the document first appearing in <RDOC>, and may use the other documents indicated in <RDOC> and/or terms indicated in <TITLE>.
'Target Retrieval' attempts to evaluate the effectiveness of the retrieval in a case where the user requires just one answer or only a few, where precision should be emphasized. The runs will be evaluated using the 10 top-ranked documents retrieved for each topic. The mandatory runs are the same as for the Topic Retrieval Subtask. Several evaluation measures will be applied.
The participants can freely submit proposals relating to their own research interests, using the document set used in the aforementioned subtasks. The results will be presented as a paper/poster in the workshop meeting. The proposals can be adopted as a subtask and investigated in detail, if they involve several participants. As the result, 'Search Results Classification' and 'Speech-Driven Retrieval' have been adopted [3].