Mining Clickthrough Data for Collaborative Web Search

Jian-Tao Sun

Microsoft Research Asia
Beijing, P.R. China

Xuanhui Wang

Department of Computer Science, University of Illinois at Urbana-Champaign
Urbana, Illinois, USA

Dou Shen

Department of Computer Science, Hong Kong University of Science and Technology
HongKang, P.R. China

Hua-Jun Zeng

Microsoft Research Asia
Beijing, P.R. China

Zheng Chen

Microsoft Research Asia
Beijing, P.R. China

ABSTRACT

This paper is to investigate the group behavior patterns of search activities based on Web search history data, i.e., clickthrough data, to boost search performance. We propose a Collaborative Web Search (CWS) framework based on the probabilistic modeling of the co-occurrence relationship among the heterogeneous web objects: users, queries, and Web pages. The CWS framework consists of two steps: (1) a cube-clustering approach is put forward to estimate the semantic cluster structures of the Web objects; (2) Web search activities are conducted by leveraging the probabilistic relations among the estimated cluster structures. Experiments on a real-world clickthrough data set validate the effectiveness of our CWS approach.

Categories & Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval-Search Process; H.3.5 [Information Storage and Retrieval]: Online Information Services-Web based services

General Terms

Algorithms, Experimentation, Performance

Keywords

Clickthrough Data, Cube-Clustering, Collaborative Web Search

1 INTRODUCTION

However, mining clickthrough data is really challenging: (1) the clickthrough data contains heterogeneous objects, including users, queries, and Web pages, and the relationship among these objects are complicated; (2) since users always only click the first few pages for a query, the data is highly sparse. In order to effectively conduct clickthrough data analysis for Web search, it is a key factor to handle the sparseness problem and discover the hidden relations among the heterogeneous data objects.

In this paper, we propose a framework to overcome the above difficulties. This framework contains the following two steps. (1) The hidden cluster structures contained in the clickthrough data, as well as the probabilistic relations among them, are discovered by an unsupervised method: cube-clustering. (2) The relations among the hidden clusters are used to improve the search performance in a collaborative manner. Since we exploit the group structures of the clickthrough data and our approach is motivated by the collaborative filtering idea, we name it Collaborative Web Search (CWS). In [3], the authors also developed a collaborative search system named I-SPY. Their system is a type of meta-search engine and requires users to explicitly select a community before search activities are conducted. In our CWS framework, the clickthrough data is automatically utilized and Web users' manual efforts are not required.

2 Collaborative Web Search

2.1 Cube-Clustering Approach

We treat a cube data of the three dimensions: user, query, and Web page, as a joint probability distribution among three discrete random variables:

. Variables

and

take values from the user set $\{u_1, \cdots,u_l\}$ , query set $\{q_1, \cdots,q_m\}$ , and page set $\{p_1, \cdots,p_n\}$ . Our goal is to cluster the

users,

queries, and

pages into

and

clusters respectively: $\{\hat{u}_1,\cdots,\hat{u}_i\}$ , $\{\hat{q}_1,\cdots,\hat{q}_j\}$ , $\{\hat{p}_1,\cdots,\hat{p}_k\}$ . We use

and

to denote the mapping functions and refer to the triple

as a cube clustering.

Given the mapping functions

, and

are the cluster random variables and we denote them as $\hat{U}$ , $\hat{Q}$ , $\hat{P}$ respectively. Apparently, a fixed cube clustering mapping determines the joint probability of the cluster random variables. We judge the quality of cube clustering by the loss in multi-information. The multi-information is a measurement to capture the amount of information that a set of variables contain about each other. Naturally, a good clustering should preserve the information of the original data as much as possible, thus minimize the loss in multi-information: $I(U,Q,P)-I(\hat{U},\hat{Q},\hat{P})$ . This is also the objective function that an optimal cube clustering minimizes, subject to the constraints on the cluster numbers in each dimension. For a cube clustering, we find the loss in multi-information is equal to the KL divergence between

and $\tilde{Pr}(U, Q, P)$ , where $\tilde{Pr}(U, Q, P)$ is a distribution of the form

where $u\in \hat{u}, q\in \hat{q}, p\in \hat{p}$ . That is, the cube clustering can be approached by minimizing the KL divergence between

and $\tilde{Pr}(U, Q, P)$ . We can also prove that the loss in multi-information can be expressed as a weighted sum of the KL-divergence between two distributions associated with a fixed dimension. Thus the calculation of the objective function can be solely expressed in terms of user clustering, query clustering, or page clustering. Based on this conclusion, we have the Cube-Clustering algorithm. We omit all the proofs for the space reason.

The input of the Cube-Clustering algorithm is the joint probability

. Assume

are the desired number of clusters for each dimension. The output is the partition function

. The Cube-Clustering algorithm is described as follows:

Step 1: Start with some initial partition functions, thus for each , , , we have its corresponding cluster. Compute

Step 2: (2.1) Calculate the distributions $\tilde{Pr}(Q,P\vert\hat{u})$ , $1\leq \hat{u} \leq i$ using

Step 4: Iterate Step 2 and Step 3 until the change in objective function is less than a small value, e.g., $10^{-6}$ ; else go to step 2.

2.2 Search Based on Cube-Clustering

After the cube-clustering step, the Web search problem is converted to recommendation of a ranked page list according to their relevance with the $\left<u,q\right>$ pair, instead of depending only on

. In our CWS framework, we rank Web pages by estimating $\tilde{Pr}(p\vert u,q)$ . Given

and

$\displaystyle \tilde{Pr}(p\vert u,q)$ $\textstyle =$ $\displaystyle \frac{\tilde{Pr}(u,q,p)}{\tilde{Pr} (u,q)}$ $% latex2html id marker 544 $\textstyle \propto$$ $\displaystyle \tilde{Pr} (u,q,p)$ $\textstyle =$ $\displaystyle Pr(\hat{u}, \hat{q}, \hat{p})Pr(u\vert\hat{u})Pr(q\vert\hat{q})Pr(p\vert\hat{p})$ $% latex2html id marker 548 $\textstyle \propto$$ $\displaystyle Pr(\hat{u}, \hat{q}, \hat{p})Pr(p\vert\hat{p})$ (4)

Thus according to Eq (4), the Web pages are ranked by $Pr(\hat{u}, \hat{q}, \hat{p})Pr(p\vert\hat{p})$ .

3 EXPERIMENTS AND CONCLUSIONS

**Table 1:** Search Results of CF and CWS
	(CF)	(CWS)	Relative Improvement
1	0.581887	0.632321	8.66%
2	0.315347	0.339479	7.75%
3	0.214571	0.232106	8.17%
4	0.162961	0.176383	8.23%
5	0.131669	0.141973	7.82%

We use 10 days' clickthrough data for experiments, the first 5 days' data is used for estimating the cluster structures and the rest 5 days' for testing. The search accuracy is evaluated using the

measure, that is, the percentage of correct pages among the top

pages. We compared the result of CWS and Pearson collaborative filtering algorithm (CF) [1]. The results are given in Table 1.

is varied from 1 to 5. We can see the CWS approach leads to better search result compared with CF (around 8% improvement).

This shows the CF algorithm can not effectively exploit the clickthrough data as the data is three-way and highly sparse. However, the cluster structures can be discovered by the Cube-Clustering algorithm and the probabilistic relations among them can be utilized for improving Web search. This validates the effectiveness of our CWS approach: leveraging the clickthrough data for collaborative Web search.

REFERENCES

[1] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In UAI, pages 43-52. Morgan Kaufman, 1998.

[2] I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In SIGKDD, pages 89-98, 2003.

[3] B. Smyth, E. Balfe, O. Boydell, K. Bradley, P. Briggs, M. Coyle, and J. Freyne. A live-user evaluation of collaborative web search. In IJCAI, pages 1419-1424, 2005.

Mining Clickthrough Data for Collaborative Web Search

Jian-Tao Sun

Microsoft Research Asia
Beijing, P.R. China

jtsun@microsoft.com

Xuanhui Wang

Department of Computer Science, University of Illinois at Urbana-Champaign
Urbana, Illinois, USA

xwang20@cs.uiuc.edu

Dou Shen

Department of Computer Science, Hong Kong University of Science and Technology
HongKang, P.R. China

dshen@cs.ust.hk

Hua-Jun Zeng

Microsoft Research Asia
Beijing, P.R. China

hjzeng@microsoft.com

Zheng Chen

Microsoft Research Asia
Beijing, P.R. China

zhengc@microsoft.com

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1 INTRODUCTION

2 Collaborative Web Search

2.1 Cube-Clustering Approach

2.2 Search Based on Cube-Clustering

3 EXPERIMENTS AND CONCLUSIONS

REFERENCES

Mining Clickthrough Data for Collaborative Web Search

Jian-Tao Sun

Microsoft Research Asia Beijing, P.R. China

jtsun@microsoft.com

Xuanhui Wang

Department of Computer Science, University of Illinois at Urbana-Champaign Urbana, Illinois, USA

xwang20@cs.uiuc.edu

Dou Shen

Department of Computer Science, Hong Kong University of Science and Technology HongKang, P.R. China

dshen@cs.ust.hk

Hua-Jun Zeng

Microsoft Research Asia Beijing, P.R. China

hjzeng@microsoft.com

Zheng Chen

Microsoft Research Asia Beijing, P.R. China

zhengc@microsoft.com

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1 INTRODUCTION

2 Collaborative Web Search

2.1 Cube-Clustering Approach

2.2 Search Based on Cube-Clustering

3 EXPERIMENTS AND CONCLUSIONS

REFERENCES

Microsoft Research Asia
Beijing, P.R. China

Department of Computer Science, University of Illinois at Urbana-Champaign
Urbana, Illinois, USA

Department of Computer Science, Hong Kong University of Science and Technology
HongKang, P.R. China

Microsoft Research Asia
Beijing, P.R. China

Microsoft Research Asia
Beijing, P.R. China