Department of Computer Science and Information Engineering,
National Taiwan University, Taipei 106, Taiwan
f2506005@csie.ntu.edu.tw and
CCHsu@csie.ntu.edu.tw
To overcome these problems, researchers have focused on automatic query expansion to help the user formulate what information is really needed. Another research topic is on relevance feedback from the user which gives the relevance of documents to clarify the ambiguity. In fact, these two techniques complement each other. However, the mechanisms of relevance feedback based on words or documents in the past research both have their own deficiencies. Word feedback has its upper bound performance in lexical-semantic expansion [3] and document feedback is sometimes too tiring for the users.
In this paper, we propose the conceptual feedback together with a joined mechanisms for query expansion. This is continuing research from our previous work based on clustering [2]. The idea is to organize the initial documents retrieved by the original query into conceptual groups such that the user could get a quick overview of what the query actually retrieves. Under this designing philosophy, we choose document clustering as our first step toward conceptual feedback. The hypothesis is that similar documents are more related to the same topic than documents that are less similar to each other.
Indeed, providing a concept-based information results as well as an interactive feedback has attracted many researchers in these two years. The dynamic browsing paradigm of Scatter/Gather that clusters documents into topical-coherent groups is applied in conventional similarity search to navigate the retrieved documents by Hearst and Pedersen [4]. On the other hand, static clustering of the database contents has also been exploited by Anick and Vaithyanathan [1]. They discuss the cognitive load required to assess the content of the clusters from the key terms and introduce natural language processing techniques to extract noun phrases for describing cluster contents.
In this paper, the target is how concept-based feedback can be achieved in a personalized Web information search assistant by integrating existing search engines and techniques of query expansion and relevance feedback. We focus on the mechanisms of keyword extraction for both cluster digesting and query expansion. Furthermore, the personalized Web search assistant can be enhanced by automatic discovery agents to search for more information based on the recorded query history.
As we mentioned earlier, relevance feedback has long been suggested as a solution for query modification. Rocchio describes an elegant approach and shows how the optimal vector space query can be derived using vector addition and subtraction given the relevant and non-relevant documents [6]. The probabilistic model proposed by Robertson and Sparck Jones shows how to adjust the individual term weight based on the distribution of the terms in relevant and non-relevant document set [5].
Now, given the cluster or concept as feedback unit, we would expect an approach to join these two models. Once each cluster has been digested as a document vector, the query can be modified by Rocchio's algorithm [6]. Thus, the problem becomes how keywords can be extracted as clusters digest and the weighting of terms in probabilistic model can be adjusted for this purpose.
The basic idea of feature selection for a concept is to highlight those words that have high frequency with respect to some contrast concept. Past research has applied Robertson and Sparck Jones' term weighting to query expansion [3], given the top 1030 documents as relevant and all other documents in the corpus as non-relevant. For keyword extraction from a cluster, a simple application is to divide the initial documents into "belonging" and "not belonging" with respect to the cluster.
However, we find the direct application of the probabilistic weighing has some problems. Since the number of documents in a cluster is not large (about 10), the weighting is useless for words that appear only in the cluster. Hence, modification of the weighting is needed in this scenario. At the time of writing, the best performance is a rewrite form of "cue validity" joined with the majority principle for keyword selection.
The application of the Web search assistant is not only for online search help but also for constructing automatic discovery agents. Given all the queries the user has submitted, they can viewed as an interest profile for the user. Thus, a background domain manager is responsible for organizing the queries into interest groups. Based on the categorization, the Web assistant suggests related queries that the user has submitted and gives hints for query formulation whenever a query is commenced. On the other hand, we could apply the technique of genetic algorithm to spawn new queries. Taking each query as a chromosome with each word viewed as a gene, the words from the queries in the same interest domain are remixed to derive new queries and search results thereafter.
There are a number of advantages in this Web assistant. First, it accelerates the browsing speed by dividing the initial results into similar document groups. Second, relevance feedback is constructed by dichotomy of documents and simply excludes the ones we do not like. Finally, the creation of a search agenda through genetic algorithm grants the property of autonomy and helps to execute the discovery automatically.