Inquirus, The NECI Meta Search Engine

The principle motivation behind the Inquirus meta search engine was the poor precision, limited coverage, limited availability, limited user interfaces, and out of date databases of the major Web search engines. Expanding on these points:

Poor precision. The diverse nature of the Web, and the focus of the Web search engines on handling relatively simple queries very quickly, leads to search results often having poor precision. Additionally, the practice of ``search engine spamming'' has become popular, whereby users add possibly unrelated keywords to their pages in order to alter the ranking of their pages. Our experience indicates that the relevance of a particular page is often obvious only after waiting for the page to load and finding the query term(s) in the page.

Limited coverage. Our experience with using different search engines suggested that the coverage of the individual engines was relatively low, i.e. searching with a second engine would often return several documents which were not returned by the first engine. The results of Selberg and Etzioni [5] suggest that the coverage of any one engine is limited.

Limited availability. Due to search engine and/or network difficulties, we have observed that the engine which responds the quickest varies over time.

Limited user interfaces. It is possible to add a number of features which enhance the usability of the search engines.

Out of date databases. Centralized search engine databases are always out of date. There is a time lag between the time when new information is made available and the time that it is indexed.

3. Previous work

The idea of querying and collating results from multiple databases is not new. Companies like PLS (http://www.pls.com), Lexis-Nexis (http://www.lexis-nexis.com), DIALOG (http://www.dialog.com), and Verity (http://www.verity.com) have long since created systems which integrate the results of multiple heterogeneous databases [5]. Many Web meta search services exist such as the popular MetaCrawler and SavvySearch services [6,1].

4. The Inquirus meta search engine

One of the fundamental features of the Inquirus meta search engine is that it analyzes each document and displays local context around the query terms. The benefit of displaying the local context, rather than an abstract or query-insensitive summary of the document, is that the user may be able to more readily determine if the document answers his or her specific query. A user can therefore find documents of high relevance by quickly scanning the local context of the query terms. This technique is simple, but can be very effective, especially in the case of Web search where the database is very large, diverse, and poorly organized. Users indicate that the page summaries generated using local context allow them to assess the relevance of documents more easily and more rapidly. Recent work by Tombros (1997) agrees: Tombros considered the use of query biased summaries and performed a user study which showed that users working with the query biased summaries had a higher success rate. The query biased summaries allowed users to perform relevance judgments more accurately and more rapidly, and greatly reduced the need to refer to the full text of documents.

The display of local context does not require the use of multiple search engines and can be very useful even if only one engine is used. However, as with other meta search engines, Inquirus makes parallel queries to multiple search engines. The major features of the Inquirus meta search engine include displaying the context of the query terms, advanced duplicate detection, progressive display of results, highlighting query terms in the pages when viewed, insertion of quick jump links for finding the query terms in large pages, dramatically improved precision for certain queries by using specific expressive forms, and improved relevancy ranking. A more complete list follows:

5. Operation

Figures 1 and 2 show a sample response of the Inquirus meta search engine for the query "image watermarking". The search form can be seen at the top, followed by links to the individual engine responses and a tip which may be query sensitive. Results which contain all of the query terms are then displayed as they are retrieved and analyzed (if none of the first few pages contain all of the query terms then the engine initially displays results which contain the maximum number of query terms found in a page so far). The bars to the left of the document titles indicate how close the query terms are in the documents (or how close they are to the start of the document for a single term) — longer bars indicate that the query terms are closer together. The engine which found the document (e.g. A = AltaVista), the age of the document (e.g. 1m = 1 month), the size of the document, and the URL follow the document title.

After all pages have been retrieved, the engine then displays the top 20 pages ranked using term proximity information. The engine then displays those pages which contain fewer query terms, those pages which contain none of the query terms, those pages which contain duplicate context strings, and those pages which could not be downloaded. Links to the search engine pages which were used are then provided. Finally, the engine displays a summary box with information on the number of documents found from each individual engine, the number retrieved and processed, and the number of duplicates. Options for Inquirus include which set of search engines to use (e.g. Web search engines or Usenet search engines), the maximum number of hits, the amount of local context to display, etc.

Figure 3 shows a sample of how the individual pages are processed when viewed. The links at the top jump to the first occurrence of the query terms in the document, and indicate the number of occurrences. The [Track Page] link activates tracking for this page - the user will be informed when and how the document changes.

Searching for: "image watermarking " using: HotBot Infoseek AltaVista Excite Lycos Northern Light Yahoo WebCrawler.

Tip: The query term links in the "Searching for" line lead to the Webster dictionary definitions.

Digital Image Watermarking A 1m 3k http://www.thomtech.com/tlab/projects/digh20.htm
...Digital Image Watermarking ... /... Digital Image Watermarking Digital Image Watermarking As publishers make digital images, audio, and video available on CDROM or online, unauthorized use is a g... /... Digital Image Watermarking Digital Image Watermarking As publishers make digital images, audio, and video available on CDROM or online, unauthorized use is a greater threat. Image watermarking - a process that encodes a signal or patter...

Related Technologies H 3m 1k http://image.hp.com/htdocs_present/iwhpoverview/tsld018.htm
... Related Technologies Related Technologies Image Content-Based Retrieval Research activities at HP Laboratories Early experiments on the WWW Image Watermarking Visible and Invisible systems For security, attribution, inventory tracking Smart Cards, Security and E-commerce Secure, Internet-based Commerce HP Imagine Card, Praes...

Watermark Example 1 A 1y 1k http://www.thomtech.com/mmedia/becker/wmark.htm
... Watermark Example 1 Multimedia Lab Digital Image Watermarking This site is still under construction..... Original Image Watermarked Image Here's the watermark Click here to see another example ... /... Watermark Example 1 Multimedia Lab Digital Image Watermarking This site is still under construction..... Original Image Watermarked Image Here's the watermark Click here to see another example ...

PC WEEK: A lasting way for artists to leave their mark I 11m 8k http://www8.zdnet.com/pcweek/reviews/1209/09mark.html
...ir mark Digimarc's watermark technology embeds ``invisible'' digital information in computer-generated images By Herb Bethoney Using Digimarc Corp.'s PictureMarc image watermarking technology, illustrators and photographers will be able to copyright their work with a persistent "watermark" that is virtually imperceptible until read by Digimarc's software reader. Although the ...

Master Thesis Project: Analysis of Image Watermarking Methods H 5m 2k http://www.it.isy.liu.se/students/master/marking.html.en
...Master Thesis Project: Analysis of Image Watermarking Methods... /... Master Thesis Project: Analysis of Image Watermarking Methods Link ping University , Department of Electrical Engineering , Information Theory Division , Student Information Master Thesis Project in Information Theory Ana... /... Link ping University , Department of Electrical Engineering , Information Theory Division , udent Information Master Thesis Project in Information Theory Analysis of Image Watermarking Methods Svensk version Examiner: Niclas Wiberg, nicwi@isy.liu.se Location: Link ping University, Department of Electrical Engineering Other...

[ ... section deleted ... ]

Ranked pages:

978

Digital Image Watermarking A 1m 3k http://www.thomtech.com/tlab/projects/digh20.htm
...Digital Image Watermarking ... /... Digital Image Watermarking Digital Image Watermarking As publishers make digital images, audio, and video available on CDROM or online, unauthorized use is a g... /... Digital Image W ermarking Digital Image Watermarking As publishers make digital images, audio, and video available on CDROM or online, unauthorized use is a greater threat. Image watermarking - a process that encodes a signal or patter...

976

Steganography - Image watermarking I 0d 4k http://www.cl.cam.ac.uk/users/fapp2/steganography/image_watermarking/index.html
...Steganography - Image watermarking... ...http://www.cl.cam.ac.uk/users/fapp2/steganography/image_watermarking/index.html... /... Steganography - Image watermarking Image Watermarking Weakness of Existing Schemes 2Mosaic S ome companies have proposed an automatic system for...

957

Abstract... A 2m 17k http://www.comm.toronto.edu/~deepa/abstracts.html
... Abstract... A Robust Digital Image Watermarking Scheme using Wavelet-Based Fusion D. Kundur and D. Hatzinakos 1997 IEEE International Conference on Image Processing Abstract We present an approach for still image watermarking in whi... /...Digital Image Watermarking Scheme using Wavelet-Based Fusion D. Kundur and D. Hatzinakos 1997 IEEE nternational Conference on Image Processing Abstract We present an approach for still image watermarking in which the watermark embedding process employs multiresolution fusion techniques and incorporates a model of the human visual system (HVS). The original unmarked image is required to extract the ...

[ ... section deleted ... ]

No search terms were found in these documents:

USENET: sci.research.postdoc : PhD Scholarship L - 2k http://mineral.umd.edu/usenet/sci.research.postdoc/0042.html
USENET: sci.research.postdoc : PhD Scholarship in Operational Research - Imperial College PhD Scholarship

[ ... section deleted ... ]

Pages with duplicate context strings to a page above:

UCR/CMP - Sponsors E - 3k http://www.cmp.ucr.edu/sponsors/

UCR/CMP - Sponsors H - 3k http://138.23.124.101/sponsors/sponsors.html
...al airline of the Museum! Adobe Systems - imaging software Digimarc Corporation , the digital image watermarking company Micropolis, Inc. - file storage Pacific Lightwave - DS1 connectivity Pacific Coast...

[ ... section deleted ... ]

These documents no longer exist:

Error 404 Not found Computer Software Vendors D http://guide.sbanetweb.com/softD.html

[ ... section deleted ... ]

Engine	Response	Total	Retrieved	Processed	Duplicates
AltaVista	Yes	139	50	36	14
Excite	No	-	-	-	-
HotBot	Yes	84	25	23	15
Infoseek	Yes	72	62	39	9
Lycos	Yes	6	6	4	0
Northern Light	Yes	88	25	22	9
WebCrawler	Yes	4	4	4	1
Yahoo	Yes	0	0	0	0
Total		393	172	128	48
More documents were found but the maximum number of hits was reached.

The engine consists of two main logical parts: the meta search code and a parallel page retrieval daemon. Pseudocode for (a simplified version of) the search code is as follows:

Figure 4 shows a simplified control flow diagram of the meta search engine. The page retrieval engine is relatively simple but does incorporate features such as queuing requests and balancing the load from multiple search processes, and delaying requests to the same site to prevent overloading a site. The page retrieval engine consists of a dispatch daemon and a number of client retrieval processes. The client processes simply retrieve the relevant pages, handling errors and timeouts, and return the pages directly to the appropriate search process.

6. Specific expressive forms

Accurate information retrieval is difficult due to the possibility of information being represented in many ways — requiring an optimal retrieval system to incorporate semantics and understand natural language. Research in information retrieval often considers techniques aimed at improving recall, e.g. word stemming and query expansion. It is possible for these techniques to decrease precision, especially in a database as diverse as the Web. The World Wide Web contains a lot of redundancy. Information is often contained multiple times and expressed in different forms across the Web. In the limit where all information is expressed in all possible ways, high precision information retrieval would be relatively simple — one would only need to search for one particular way of expressing the information. While such a goal will never be reached for all information, our experiments indicate that the Web is already sufficient for an approach based on searching for specific ways of expressing information to be effective for certain retrieval tasks.

Our proposed method is to transform queries in the form of a question, into specific forms for expressing the answer. For example, the query What does NASDAQ stand for? is transformed into the query "NASDAQ stands for" "NASDAQ is an abbreviation" "NASDAQ means". Clearly the information may be contained in a different form to these three possibilities, however if the information does exist in one of these forms, then there is a high likelihood that finding these phrases will provide the answer to the query. The technique thus trades recall for precision. The Inquirus meta search engine currently uses the specific expressive forms (SEF) technique for a number of queries, e.g. What [is|are] x?, What [causes|creates|produces] x?, What does x [stand for|mean]?, [Why|how] [is|are] (a|the) x y?, etc. As an example of the transformations, What does x [stand for|mean]? is currently converted to "x stands for" "x is an abbreviation" "x means", and What [causes|creates|produces] x? is currently transformed to "x is caused" "x is created" "causes x" "produces x" "makes x" "creates x". Different search engines use different stop words (common words that are not indexed, e.g. "the") and relevance measures, and this tends to result in some engines returning many pages not containing the SEFs. We therefore filter out the offending phrases from the queries for the relevant engines.

Figure 5 shows the response of the Inquirus meta search engine for the query What does NASDAQ stand for?. The answer to the query is contained in the local context displayed for 9 out of the first 10 pages. In contrast, the response of standard search engines often does not contain the answer to the query in any of the documents listed on the first page, even for engines which list support for natural language queries.

It is reasonable to expect that the amount of easily accessible information will increase over time, and therefore that the viability of the specific expressive forms technique will improve over time. An extension which we have not currently implemented is to define an order over the various SEFs, e.g. "x stands for" may result in higher precision for the query What does x stand for? than the phrase "x means". If none of the SEFs are found then the engine could fall back to a standard query.

[ ... section deleted ... ]
Ref:...ex Search Previous Next Subject: Exchanges - The NASDAQ Last-Revised: 25 Oct 1996 From: billmanr@aol.com , jeffwben@aol.com , lott@invest-faq.com NASDAQ is an abbreviation for the National Association of Securities Dealers Automated Quotation system. It is also commonly, and confusingly, called the OTC market. Visit their home page: http://www.nasdaq.com...
Ref:... - Subject: Exchanges - The NASDAQ Last-Revised: 25 Oct 1996 From: billmanr@aol.com , jeffwben@aol.com , cml@cs.umd.edu NASDAQ is an abbreviation for the National Association of Securities Dealers Automated Quotation system. It is also commonly, and confusingly, called the OTC market. Visit their home page: http://www.nasdaq.com The NASD...
Ref:...on of Securities Dealers, Inc., the selfregulatory organization of the securities industry responsible for the operation and regulation of the NASDAQ stock market and overthecounter markets. NASDAQ Stands for the National Association of Securities Dealers Automated Quotation System. A nationwide computerized quotation system for current bid and asked quotations on over 5,500 over-the-counter stocks. ...

[ ... section deleted ... ]

7. Efficiency

A simple analysis of page retrieval times leads to some interesting conclusions. Table 2 shows the median time for each of six major search engines to respond, along with the median time for the first of the six engines to respond when queries are made simultaneously to all engines, and the median time for the Inquirus search engine to display the first result. It can be seen that, on average, the parallel architecture of Inquirus allows it to find, download and analyze the first page faster than the standard search engines can produce a result, even though the standard engines do not download and analyze the pages. The Inquirus engine is surprisingly fast, with the only user comments regarding speed currently being that the engine is fast. These results are from 1,000 queries performed during September 1997, and we note that the relative speed of the search engines varies significantly over time, and depends on the location of the accessing site.

One potential drawback of the Inquirus search engine is that it uses significantly more bandwidth than other search engines. Although none of the current users have expressed concern, we expect that some Internet users may be concerned. The additional bandwidth requirements could limit the number of users which can simultaneously use a server based implementation, or present a disadvantage if Internet access is charged according to the volume of data transmitted. We note simply that the bandwidth requirements are not great compared to the increasing use of audio and video on the Web, and that, even if bandwidth requirements are important now, they will be less important in the future. Certainly, Inquirus is far more efficient than brute force search of the Web.

Table 2. Median time for a response from the search engines and the meta engine; the meta engine displays the first result faster than the average time taken by a standard search engine
Search engine	Median time for response (seconds)
AltaVista	0.9
Infoseek	1.3
HotBot	2.6
Excite	5.2
Lycos	2.8
Northern Light	7.5
All engines (average)	2.7
First of 6 search engines	0.8
First result from the Inquirus meta search engine	1.3

8. Conclusions

The Inquirus meta search engine demonstrates that real-time analysis of documents returned from Web search engines is feasible. In fact, calling the Web search engines and downloading Web pages in parallel allows the Inquirus meta search engine to, on average, display the first result quicker than using a standard search engine. User feedback indicates that the display of real-time local context around query terms, and the highlighting of query terms in the documents when viewed, significantly improves the efficiency of searching the Web.

Inquirus, the NECI meta search engine

1. Introduction

2. Motivation

3. Previous work

4. The Inquirus meta search engine

5. Operation

6. Specific expressive forms

7. Efficiency

8. Conclusions

References

Vitae