We present in this work a genetic search strategy called GeniMiner for a search engine in the context of strategic watch. We highlight the relations that exist between Web statistical studies, search engines and optimization techniques. The user query is used to mathematically define a fitness function of Web pages. The genetic algorithm (GA) manages a population of pages and aims at maximizing this fitness function. We define a creation operator which uses the results given by standard search engines. The mutation operator consists in exploring links. We present an empirical evaluation.
search engines, genetic algorithms, meta-search, optimization
Searching the web is very similar to optimization problems: the web is a search space S structured as a graph with a neighborhood relation (Albert et al. 1999) (Broder et al. 2000). A fitness function f: S ® R+ can evaluate web pages (like for instance in Google (Brin and Page 1998)). A search engine tries to output pages which maximize f. The web can be explored with operators which are related to those used in optimization: creation operators that generate pages in S from scratch either randomly like for instance random IP generation (Lawrence and Giles 1999) or with heuristic principles like in meta-search engines which use themselves standard index-based search engines. Local search operators are also used. They consist in exploring the neighborhood of a given point in S by choosing links going out of that point, as local search operators do in optimization.
We propose in this work a search engine that is complementary to standard ones. We suppose that the user can wait for his results during several hours but provided that he spends a shorter time in manual analysis of the results. This is a crucial point in strategic watch. We are especially interested in the following problem: given a fixed number of pages evaluations (i.e. pages downloading, content analysis, etc), what strategy can hopefully maximize the user's gain (i.e. which pages/links should be explored in order to find interesting documents that will minimize the time devoted to manual analysis)? We use the fact that the GA solves (almost) optimally such a decision problem as demonstrated in (Holland 1975) on the 2-armed bandit problem.
Define the evaluation function f from the user query,
Pop ¬ Æ (Po is progressively filled in through steps 3, 4, 5)
Generate an offspring page O:
With probability (1-Pmut) (or if |Pop| < 2) Then O ¬ next page from standard search engines (creation operator)
With probability Pmut Then Select one parent page P from Pop and let O ¬ page indicated by a link in P (local exploration with a mutation operator),
Evaluate f(O)
Insert O in Pop if (|Pop| < PopMax) or if f(O) is greater than the fitness of the worst page in Pop which is deleted,
Go to step 3 or Stop (Pop is the output given to the user).
Table 1: the GA used in GeniMiner
The GA that we propose for Web mining is represented in table 1. An individual in the population is a Web page which can be numerically evaluated with the fitness function f. This function represents the user query and requires to download the page and analyze its content. It can take into account for instance keywords, regular expressions, links, size of document, referenced files (images, mp3, etc).
The creation operator queries standard search engines (Altavista, Google, Lycos, Voila, Yahoo) with the user's keywords. The mutation of a parent page P consists in 1) selecting P in the population (choosing the best between two randomly chosen individuals), 2) choosing one link l going out of P, and 3) proposing the so-pointed page as an offspring. We have sorted P's links according to their distance to the keywords in the text so that the most promising links are explored first.
Num | Keywords (K1, K2, ...) | Should (S1, S2, ...) | Should not (Sn1, Sn2, ...) |
1 | telecom crisis analysis | mobile future | France |
2 | fiber optic technology information | network | |
3 | text poet flower wind | Baudelaire | Rimbaud |
4 | wine excellent price good | Bourgueil | Bordeaux |
5 | buy cd music michael jackson | download mp3 | |
6 | mouse disney movie animation | DVD | |
7 | artificial ant algorithm | ||
8 | genetic algorithm artificial ant | experimental comparison | |
9 | javascript window opener | tutorial free | |
10 | dll export class template | code example |
Table 2: the queries used in our tests
We have compared GeniMiner with Pmut = 0.5 (GA dominates the search) to GeniMiner with Pmut = 0 (a meta-search where pages are generated with standard search engines only). We have tested also two evaluation functions for a page P: f1 is closer to the evaluation functions used in standard search engines (presence of all keywords (K1, K2, ...) + maximization of the number of keywords). f2 is more complex and takes into account the keywords but also the number of links going out of P (with a more important weight when they are close to keywords), additional words (S1, S2, ...) and (Sn1, Sn2, ...) which should (resp. should not) be present in P, the fact that the proportions of keywords are uniform. Both algorithms are given 1000 pages evaluations for each tested query (see table 2) and PopMax = 100 . A run represents from one to two hours of downloading and computing time on a very standard computer. We have performed a parameter study (Popmax, Pmut) which is not presented here.
Query | GeniMiner (Pmut = 0.5) | Meta-search (Pmut = 0) | ||||
Min | Max | Mean | Min | Max | Mean | |
1 | 18 | 177 | 35.4 | 20 | 177 | 39.5 |
2 | 37 | 856 | 119.0 | 42 | 856 | 131.0 |
3 | 33 | 922 | 144.4 | 46 | 8011 | 196.8 |
4 | 68 | 908 | 163.7 | 81 | 908 | 170.0 |
5 | 45 | 10006 | 324.8 | 27 | 10006 | 267.6 |
6 | 66 | 819 | 137.9 | 90 | 869 | 188.9 |
7 | 33 | 317 | 76.1 | 32 | 715 | 73.5 |
8 | 52 | 932 | 145.1 | 55 | 1017 | 171.1 |
9 | 68 | 672 | 132.3 | 68 | 672 | 140.1 |
10 | 42 | 489 | 94.9 | 54 | 2504 | 156.6 |
Table 3: comparative results for function f1 (last population)
Query | GeniMiner (Pmut = 0.5) | GeniMiner (Pmut = 0.3) | Meta-search (Pmut = 0) | ||||||
Min | Max | Mean | Min | Max | Mean | Min | Max | Mean | |
1 | 37.2 | 1034.4 | 83.2 | 39.5 | 1034.4 | 88.2 | 41.2 | 1034.4 | 91.3 |
2 | 102.7 | 810 | 212.9 | 109.8 | 825.5 | 222.5 | 99.3 | 794.5 | 205.4 |
3 | 48.8 | 1223.7 | 195.7 | 53.4 | 1839.1 | 182.1 | 54.6 | 1102.7 | 147.4 |
4 | 100.6 | 977.4 | 234.1 | 108.2 | 977.4 | 204.0 | 115.4 | 992.0 | 243.3 |
5 | 277.4 | 2229.7 | 439.6 | 219.8 | 2107.7 | 418.5 | 172.3 | 2107.7 | 375.2 |
6 | 150.3 | 841.4 | 256.7 | 151.8 | 1037.2 | 284.3 | 155.2 | 1037.2 | 281.7 |
7 | 63.7 | 575.3 | 129.3 | 59.7 | 575.3 | 120.3 | 51.4 | 575.3 | 96.6 |
8 | 122.7 | 1462.3 | 271.6 | 107.7 | 1510.8 | 273.5 | 129.3 | 1510.8 | 310.3 |
9 | 87.3 | 1097.0 | 198.2 | 78.8 | 1097.0 | 177.6 | 88.5 | 1097.0 | 198.4 |
10 | 99.2 | 2091.7 | 258.7 | 130.4 | 2091.7 | 328.8 | 141.5 | 3371.7 | 430.3 |
Table 4: comparative results for function f2
Results are presented in tables 3 and 4. Our conclusions are the following: when the evaluation function (like f1) is close to those used in standard search engines, then the results obtained by the GA are identical to those obtained with a meta-search. But when the evaluation function is different (like f2), then the results obtained by the GA can be more relevant compared to a meta-search. This means that the genetic search can be valuable when 1) the user can wait for a longer time than in standard search engines, 2) queries are more complex or more precise than a list of keywords. We argue that strategic watch is such a domain.
The main perspectives that can be derived from this work are the followings: the comparison to other agent-based search like (Menczer et al. 1995), the use of a parallel GA to speed up the search, the self-adaptation of parameters which will improve the GA performances, the clustering of the results, the interactive personalization of the search, the use of a large disk cache, the study of a real world application in the context of strategic watch.