ABSTRACT

In this work we propose a method that retrieves a list of related queries given an initial input query. The related queries are based on the query log of previously issued queries by human users, which can be discovered using our improved association rule mining model. Users can use the suggested related queries to tune or redirect the search process. Our method not only discovers the related queries, but also ranks them according to the degree of their relatedness. Unlike many other rivalry techniques, it exploits only limited query log information and performs relatively better on queries in all frequency divisions.

2. DEFINITIONS

We present the definitions of key terminologies in this section.

Query Record: A query record represents the submission of one single query from a user to the search engine at a certain time. It can be represented as a set of triplets I_i = (q_i, ip_i, t_i), where q_i is the submitted query (i.e. terms), ip_i is the IP address of the host from which the user issues the query, and t_i represents the timestamp when the user submits that query.

Query Transaction: A query transaction is the search process 1) with the search interest focusing on the same topic or strongly related topics, 2) in a bounded and consecutive period, and 3) issued by the same user. It is represented as a series of query records in temporal order, i.e. T_j = {I_j1, I_j2, …, I_jm} = {(q_j1, ip_j1, t_j1), (q_j2, ip_j2, t_j2), …, (q_jm, ip_jm, t_jm)} where ip_j1 = ip_j2 = … = ip_jm and t_j1 ≤ t_j2 ≤ … ≤ t_jm.

User Session: A user session contains the history of all query records that belong to the same user, in the query log. It can be represented as a series of query records in temporal order, i.e. S_k = {I_k1, I_k2, …, I_kn} = {(q_k1, ip_k1, t_k1), (q_k2, ip_k2, t_k2), …, (q_kn, ip_kn, t_kn)} where ip_k1 = ip_k2 = … = ip_kn, t_k1 ≤ t_k2≤ … ≤ t_kn and n ≥ m.

Given these definitions, we have the following constraints:

equation 1 (1)

(2)

(3)

(4)

3. LEVENSHTEIN DISTANCE SIMILARITY

Because search engine users often reformulate their input queries by adding, deleting or changing some words of the original query string, we use Levenshtein distance [3], which is a special type of edit distance, to measure the degree of matching between query strings. It defines a set of edit operations, such as insertion or deletion of a word, together with a cost for each operation. The distance between two query strings then is defined to be the sum of the costs in the cheapest chain of edit operations transforming one query string into the other. For example, the Levenshtein distance between "adobe photoshop" and "photoshop" is 1.

Hence the similarity between two queries can be measured by the Levenshtein distance similarity between them and defined as:

(5)

where wn(.) is the number of words (or characters for Chinese queries) in a query.

The Levenshtein distance similarity is seldom applied to finding related queries because it retrieves only highly matching queries and thus fails to discover those related queries that are dissimilar in their terms, e.g. "search engine" and "google".

4. SEGMENTATION ALGORITHM

Our proposed model is based on the traditional association rule mining technique. For mining association rules of queries, we need to statistically measure the co-occurrences between queries in query transactions; so the quality of segmenting user sessions into query transactions is critical for mining related queries.

We developed a dynamic sliding window segmentation algorithm that adopts three time interval constraints, i.e. 1) the maximum interval length allowed between adjacent query records in a same query transaction (α), 2) the maximum interval length of the period during which the user is allowed to be inactive (β), and 3) the maximum length of the time window the query transaction is allowed to span (γ) (α ≤ γ ≤ β). It also sets a lower bound for the Levenshtein distance similarity between adjacent queries, i.e. θ, to justify the borders of query transactions. We empirically set the values of α, β, γ, and θ to be 5 minutes, 24 hours, 60 minutes and 0.4 respectively in our experiments. The complexity of this algorithm is O(n). Figure 1 shows the pseudo-codes for this segmentation algorithm.

Input: A set of user sessions S = {S₁, S₂, S₃, …, S_n} where S_k is a series of query records in temporal order, i.e. {I_k1, I_k2, …, I_kn} = {(q_k1, ip_k1, t_k1), (q_k2, ip_k2, t_k2), …, (q_kn, ip_kn, t_kn)}

Output: A set of query transactions T = {T₁, T₂, T₃, …, T_n}.

Procedure SEGMENT

transaction set T Φ

sort S in temporally ascending order

for each S_k in S

transaction t new_empty_transaction

append t to T

timestamp of previous query record t_pre t_k1

start time of current transaction t t_{cur_trans} t_k1

for each I_ki in S_k

if t_ki - t_pre ≤ α and t_ki - t_{cur_trans} ≤ γ then

append I_ki to t

else if t_ki - t_pre > β then

t new_empty_transaction

append t to T

append I_ki to t

t_{cur_trans} t_ki

else

find the last query record I_{last_in_t} in t, i.e. closest to I_ki

compare the query q_{last_in_t} of query record I_{last_in_t} to the query q_ki of query record I_ki

if q_{last_in_t} ≠ q_ki then

calculate similarity_Levenshtein(q_{last_in_t}, q_ki)

if similarity_Levenshtein(q_{last_in_t}, q_ki) < θ then

t new_empty_transaction

append t to T

t_{cur_trans} t_ki

end

append I_ki to t

end

t_pre t_ki

end

return T

Figure 1.Dynamic Sliding Window Segmentation Algorithm

5. MINING RELATED QUERIES

Our model is a modified-confidence version of the traditional approach of mining association rules. Here we define Q = {q₁, q₂, q₃, …, q_n} as the set of unique queries from query log files and T is the set of query transactions t. For each t there is a binary vector t[k] such that t[k] = 1 if query transaction t contains query record I_i that searched for query q_k, and t[k] = 0 otherwise. Let X be a non-empty subset of Q. A transaction t satisfies X if for all queries q_k in X, t[k] = 1.

The association rule is redefined to mean an implication X q_j, where X Q, and q_j X. As we are interested only in finding related queries given an initial input query, the set X contains only the initial input query q_i, i.e. X = {q_i}. Therefore the association rule in this problem becomes q_i q_j, where q_i Q, q_j Q and i ≠ j. Mining related queries is simplified as finding the statistical associations between the input query and any other queries, hence.

The association rule q_i q_j has a support factor of s if s% of the transactions in T satisfy both {q_i} and {q_j}, notated as q_i q_j | s. We define the raw confidence factor of the association rule q_i q_j to be rc if rc% of the transactions in T' satisfy {q_j}, given that T' is the set of all transactions in T that satisfy {q_i}, and is notated as q_i q_j | rc. Then we combine the raw confidence factor with the Levenshtein distance similarity between q_i and q_j. The final confidence factor of q_i q_j is calculated as:

(6)

Assuming the input query is q_i, we calculate the support factor q_i q_j | s and confidence factor q_i q_j | c of any hypothesized association rule q_i q_j (q_j Q, i ≠ j). Then we first set a threshold min_support for the support factors to filter away those association rules that are not statistically strong enough. Next we rank the list of association rules according to their confidence factors. Finally we select the top K queries (if available) in the list and return them as the most related queries to the input query q_i.

The Levenshtein distance similarity is introduced as a non-penalizing decaying factor in (6), which is non-linear. We found that the traditional association rule mining model favors frequent queries and often fail to retrieve infrequent queries that are highly similar to the input query. The non-linear non-penalizing decaying factor promotes the positions of those queries in the ranked list without penalizing others significantly.

6. EXPERIMENTS

We have tested our method on a dataset collected from the query logs of Tianwang (www.tianwang.com) search engine. It covers 4 months from March 2003 to June 2003 and about 80% of the queries in it contain Chinese words. Approximately 14 million query records and 3 million distinct queries are identified.

We selected 100 test input queries randomly according to the overall frequency distributions. The frequencies of these test input queries range from 50 to 75,975 evenly. For each input query we selected the top 20 returned queries, if available, for experimental evaluations. Overall precision rates were calculated after the relatedness of retrieved queries was evaluated by a group of three human annotators.We compare our improved association rule mining model with three rivalry models including 1) temporal correlation model [2] (TCM) as the baseline, 2) association rule mining model [1] (ARM), and 3) our improved association rule mining model (ARM_LDS). We also compare our dynamic sliding window segmentation algorithm (DSW SA) with the naïve segmentation algorithm (Naïve SA) proposed in Fonseca, et al. [1]. The experimental results are presented in Table 1 below.

Table 1. The Precision Rates of Our Experiment Results

*Top K Queries*	*TCM*	*Naive SA*		*DSW SA*
*Top K Queries*	*TCM*	*ARM*	*ARM_LDS*	*ARM*	*ARM_LDS*
1	56.65	91.86	94.65	95.35	97.65
5	60.47	85.60	89.73	90.88	93.64
10	54.88	81.11	85.44	88.45	90.59
15	50.63	75.76	80.88	86.05	89.88
20	44.32	71.66	76.29	83.29	88.44

7. CONCLUSIONS

In this paper we propose a method of automatically discovering related queries from web search engine query logs. This method first segments the user sessions identified in query logs into query transactions, and then mines association rules of related queries using an improved association rule mining model which utilizes not only the co-occurrences between distinct queries but also the Levenshtein distance similarity between them. The experimental result shows that our method obtained approximate gains (in precision rates with K = 20) 17% and 44% compared with rival models and the baseline respectively.

REFERENCES

[1]. B. M. Fonseca, P. Golgher, B. Pôssas, et al. Concept-based interactive query expansion. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM05), Bremen, Germany, 2005.

[2]. S. Chien, and N. Immorlica. Semantic similarity between search engine queries using temporal correlation. In Proceedings of the 14th International Conference on World Wide Web (WWW05), Chiba, Japan, 2005.

[3]. M. Gilleland. Levenshtein Distance, in Three Flavors. URL: http:/www.merriampark.com/ld.htm.

Mining Related Queries from Search Engine Query Logs

Xiaodong Shi

Department of System Engineering and Engineering Management
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

xdshi@se.cuhk.edu.hk

Christopher C. Yang

Department of System Engineering and Engineering Management
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

yang@se.cuhk.edu.hk

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1. INTRODUCTION

2. DEFINITIONS

3. LEVENSHTEIN DISTANCE SIMILARITY

4. SEGMENTATION ALGORITHM

5. MINING RELATED QUERIES

6. EXPERIMENTS

7. CONCLUSIONS

REFERENCES

Mining Related Queries from Search Engine Query Logs

Xiaodong Shi

Department of System Engineering and Engineering ManagementThe Chinese University of Hong Kong Shatin, N.T., Hong Kong

xdshi@se.cuhk.edu.hk

Christopher C. Yang

Department of System Engineering and Engineering ManagementThe Chinese University of Hong Kong Shatin, N.T., Hong Kong

yang@se.cuhk.edu.hk

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1. INTRODUCTION

2. DEFINITIONS

3. LEVENSHTEIN DISTANCE SIMILARITY

4. SEGMENTATION ALGORITHM

5. MINING RELATED QUERIES

6. EXPERIMENTS

7. CONCLUSIONS

REFERENCES

Department of System Engineering and Engineering Management
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

Department of System Engineering and Engineering Management
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong