Steve Lawrence and C. Lee Giles
NEC Research Institute,
4 Independence Way, Princeton, NJ 08540, U.S.A.
lawrence@research.nj.nec.com and
giles@research.nj.nec.com
The principle motivation behind the Inquirus meta search engine was the poor precision, limited coverage, limited availability, limited user interfaces, and out of date databases of the major Web search engines. Expanding on these points:
Poor precision. The diverse nature of the Web, and the focus of the Web search engines on handling relatively simple queries very quickly, leads to search results often having poor precision. Additionally, the practice of ``search engine spamming'' has become popular, whereby users add possibly unrelated keywords to their pages in order to alter the ranking of their pages. Our experience indicates that the relevance of a particular page is often obvious only after waiting for the page to load and finding the query term(s) in the page.
Limited coverage. Our experience with using different search engines suggested that the coverage of the individual engines was relatively low, i.e. searching with a second engine would often return several documents which were not returned by the first engine. The results of Selberg and Etzioni [5] suggest that the coverage of any one engine is limited.
Limited availability. Due to search engine and/or network difficulties, we have observed that the engine which responds the quickest varies over time.
Limited user interfaces. It is possible to add a number of features which enhance the usability of the search engines.
Out of date databases. Centralized search engine databases are always out of date. There is a time lag between the time when new information is made available and the time that it is indexed.
The idea of querying and collating results from multiple databases is not new. Companies like PLS (http://www.pls.com), Lexis-Nexis (http://www.lexis-nexis.com), DIALOG (http://www.dialog.com), and Verity (http://www.verity.com) have long since created systems which integrate the results of multiple heterogeneous databases [5]. Many Web meta search services exist such as the popular MetaCrawler and SavvySearch services [6,1].
One of the fundamental features of the Inquirus meta search engine is that it analyzes each document and displays local context around the query terms. The benefit of displaying the local context, rather than an abstract or query-insensitive summary of the document, is that the user may be able to more readily determine if the document answers his or her specific query. A user can therefore find documents of high relevance by quickly scanning the local context of the query terms. This technique is simple, but can be very effective, especially in the case of Web search where the database is very large, diverse, and poorly organized. Users indicate that the page summaries generated using local context allow them to assess the relevance of documents more easily and more rapidly. Recent work by Tombros (1997) agrees: Tombros considered the use of query biased summaries and performed a user study which showed that users working with the query biased summaries had a higher success rate. The query biased summaries allowed users to perform relevance judgments more accurately and more rapidly, and greatly reduced the need to refer to the full text of documents.
The display of local context does not require the use of multiple search engines and can be very useful even if only one engine is used. However, as with other meta search engines, Inquirus makes parallel queries to multiple search engines. The major features of the Inquirus meta search engine include displaying the context of the query terms, advanced duplicate detection, progressive display of results, highlighting query terms in the pages when viewed, insertion of quick jump links for finding the query terms in large pages, dramatically improved precision for certain queries by using specific expressive forms, and improved relevancy ranking. A more complete list follows:
where Np is the number of query terms that are present in the document (each term is counted only once), Nt is the total number of query terms in the document (each term is counted as many times as it appears), d (i, j) is the minimum distance between the ith and jth of the query terms which are present in the document (currently in terms of the number of characters), c1 is a constant which controls the overall magnitude of R, c2 is a constant specifying the maximum distance between query terms which is considered useful, and c3 is a constant specifying the importance of term frequency (currently c1 = 100, c2 = 5000, and c3 = 10 c1). When there is only one query term we currently use the distance from the start of the page to the first occurrence of the term as an indicator of relevance.
We have found that this ranking criterion can be particularly useful with Web searches. A query for multiple terms on the Web often returns documents which contain all terms, but the terms are far apart in the document and may be in unrelated sections of the page, e.g. in separate Usenet messages archived on a single Web page, or in separate bookmarks on a page containing a list of bookmarks.
Figures 1 and 2 show a sample response of the Inquirus meta search engine for the query "image watermarking". The search form can be seen at the top, followed by links to the individual engine responses and a tip which may be query sensitive. Results which contain all of the query terms are then displayed as they are retrieved and analyzed (if none of the first few pages contain all of the query terms then the engine initially displays results which contain the maximum number of query terms found in a page so far). The bars to the left of the document titles indicate how close the query terms are in the documents (or how close they are to the start of the document for a single term) longer bars indicate that the query terms are closer together. The engine which found the document (e.g. A = AltaVista), the age of the document (e.g. 1m = 1 month), the size of the document, and the URL follow the document title.
After all pages have been retrieved, the engine then displays the top 20 pages ranked using term proximity information. The engine then displays those pages which contain fewer query terms, those pages which contain none of the query terms, those pages which contain duplicate context strings, and those pages which could not be downloaded. Links to the search engine pages which were used are then provided. Finally, the engine displays a summary box with information on the number of documents found from each individual engine, the number retrieved and processed, and the number of duplicates. Options for Inquirus include which set of search engines to use (e.g. Web search engines or Usenet search engines), the maximum number of hits, the amount of local context to display, etc.
Figure 3 shows a sample of how the individual pages are processed when viewed. The links at the top jump to the first occurrence of the query terms in the document, and indicate the number of occurrences. The [Track Page] link activates tracking for this page - the user will be informed when and how the document changes.
Related Technologies H 3m 1k http://image.hp.com/htdocs_present/iwhpoverview/tsld018.htm ... Related Technologies Related Technologies Image Content-Based Retrieval Research activities at HP Laboratories Early experiments on the WWW Image Watermarking Visible and Invisible systems For security, attribution, inventory tracking Smart Cards, Security and E-commerce Secure, Internet-based Commerce HP Imagine Card, Praes... Watermark Example 1 A 1y 1k http://www.thomtech.com/mmedia/becker/wmark.htm ... Watermark Example 1 Multimedia Lab Digital Image Watermarking This site is still under construction..... Original Image Watermarked Image Here's the watermark Click here to see another example ... /... Watermark Example 1 Multimedia Lab Digital Image Watermarking This site is still under construction..... Original Image Watermarked Image Here's the watermark Click here to see another example ... PC WEEK: A lasting way for artists to leave their mark I 11m 8k http://www8.zdnet.com/pcweek/reviews/1209/09mark.html ...ir mark Digimarc's watermark technology embeds ``invisible'' digital information in computer-generated images By Herb Bethoney Using Digimarc Corp.'s PictureMarc image watermarking technology, illustrators and photographers will be able to copyright their work with a persistent "watermark" that is virtually imperceptible until read by Digimarc's software reader. Although the ...
|
Error 404 Not found Computer Software Vendors D http://guide.sbanetweb.com/softD.html
|
|
The engine consists of two main logical parts: the meta search code and a parallel page retrieval daemon. Pseudocode for (a simplified version of) the search code is as follows:
Process the request to check syntax and create regular expressions which are used to..
..match query terms
Send requests (modified appropriately) to all relevant search engines
Loop for each page retrieved until maximum number of results or all pages retrieved
If page is from a search engine
Parse search engine response extracting hits and any link for the next..
..set of results
Send requests for all of the hits
Send request for the next set of results if applicable
Else
Check page for query terms and create context strings if found
Print page information and context strings if all query terms are found..
..and duplicate context strings have not been encountered before
Endif
End loop
Re-rank pages using proximity and term frequency information
Print page information and context strings for pages which contained some but not all..
..query terms
Print page information for pages which contained no query terms
Print page information and context strings for pages which contain duplicate context strings
Print page information for pages which could not be downloaded
Print summary statistics
Figure 4 shows a simplified control flow diagram of the meta search engine. The page retrieval engine is relatively simple but does incorporate features such as queuing requests and balancing the load from multiple search processes, and delaying requests to the same site to prevent overloading a site. The page retrieval engine consists of a dispatch daemon and a number of client retrieval processes. The client processes simply retrieve the relevant pages, handling errors and timeouts, and return the pages directly to the appropriate search process.
Accurate information retrieval is difficult due to the possibility of information being represented in many ways requiring an optimal retrieval system to incorporate semantics and understand natural language. Research in information retrieval often considers techniques aimed at improving recall, e.g. word stemming and query expansion. It is possible for these techniques to decrease precision, especially in a database as diverse as the Web. The World Wide Web contains a lot of redundancy. Information is often contained multiple times and expressed in different forms across the Web. In the limit where all information is expressed in all possible ways, high precision information retrieval would be relatively simple one would only need to search for one particular way of expressing the information. While such a goal will never be reached for all information, our experiments indicate that the Web is already sufficient for an approach based on searching for specific ways of expressing information to be effective for certain retrieval tasks.
Our proposed method is to transform queries in the form of a question, into specific forms for expressing the answer. For example, the query What does NASDAQ stand for? is transformed into the query "NASDAQ stands for" "NASDAQ is an abbreviation" "NASDAQ means". Clearly the information may be contained in a different form to these three possibilities, however if the information does exist in one of these forms, then there is a high likelihood that finding these phrases will provide the answer to the query. The technique thus trades recall for precision. The Inquirus meta search engine currently uses the specific expressive forms (SEF) technique for a number of queries, e.g. What [is|are] x?, What [causes|creates|produces] x?, What does x [stand for|mean]?, [Why|how] [is|are] (a|the) x y?, etc. As an example of the transformations, What does x [stand for|mean]? is currently converted to "x stands for" "x is an abbreviation" "x means", and What [causes|creates|produces] x? is currently transformed to "x is caused" "x is created" "causes x" "produces x" "makes x" "creates x". Different search engines use different stop words (common words that are not indexed, e.g. "the") and relevance measures, and this tends to result in some engines returning many pages not containing the SEFs. We therefore filter out the offending phrases from the queries for the relevant engines.
Figure 5 shows the response of the Inquirus meta search engine for the query What does NASDAQ stand for?. The answer to the query is contained in the local context displayed for 9 out of the first 10 pages. In contrast, the response of standard search engines often does not contain the answer to the query in any of the documents listed on the first page, even for engines which list support for natural language queries.
It is reasonable to expect that the amount of easily accessible information will increase over time, and therefore that the viability of the specific expressive forms technique will improve over time. An extension which we have not currently implemented is to define an order over the various SEFs, e.g. "x stands for" may result in higher precision for the query What does x stand for? than the phrase "x means". If none of the SEFs are found then the engine could fall back to a standard query.
Ref:...ex Search Previous Next Subject: Exchanges - The NASDAQ Last-Revised: 25 Oct 1996 From: billmanr@aol.com , jeffwben@aol.com , lott@invest-faq.com NASDAQ is an abbreviation for the National Association of Securities Dealers Automated Quotation system. It is also commonly, and confusingly, called the OTC market. Visit their home page: http://www.nasdaq.com... Ref:... - Subject: Exchanges - The NASDAQ Last-Revised: 25 Oct 1996 From: billmanr@aol.com , jeffwben@aol.com , cml@cs.umd.edu NASDAQ is an abbreviation for the National Association of Securities Dealers Automated Quotation system. It is also commonly, and confusingly, called the OTC market. Visit their home page: http://www.nasdaq.com The NASD... Ref:...on of Securities Dealers, Inc., the selfregulatory organization of the securities industry responsible for the operation and regulation of the NASDAQ stock market and overthecounter markets. NASDAQ Stands for the National Association of Securities Dealers Automated Quotation System. A nationwide computerized quotation system for current bid and asked quotations on over 5,500 over-the-counter stocks. ... |
A simple analysis of page retrieval times leads to some interesting conclusions. Table 2 shows the median time for each of six major search engines to respond, along with the median time for the first of the six engines to respond when queries are made simultaneously to all engines, and the median time for the Inquirus search engine to display the first result. It can be seen that, on average, the parallel architecture of Inquirus allows it to find, download and analyze the first page faster than the standard search engines can produce a result, even though the standard engines do not download and analyze the pages. The Inquirus engine is surprisingly fast, with the only user comments regarding speed currently being that the engine is fast. These results are from 1,000 queries performed during September 1997, and we note that the relative speed of the search engines varies significantly over time, and depends on the location of the accessing site.
Search engine | Median time for response (seconds) |
AltaVista | 0.9 |
Infoseek | 1.3 |
HotBot | 2.6 |
Excite | 5.2 |
Lycos | 2.8 |
Northern Light | 7.5 |
All engines (average) | 2.7 |
First of 6 search engines | 0.8 |
First result from the Inquirus meta search engine | 1.3 |
One potential drawback of the Inquirus search engine is that it uses significantly more bandwidth than other search engines. Although none of the current users have expressed concern, we expect that some Internet users may be concerned. The additional bandwidth requirements could limit the number of users which can simultaneously use a server based implementation, or present a disadvantage if Internet access is charged according to the volume of data transmitted. We note simply that the bandwidth requirements are not great compared to the increasing use of audio and video on the Web, and that, even if bandwidth requirements are important now, they will be less important in the future. Certainly, Inquirus is far more efficient than brute force search of the Web.
The Inquirus meta search engine demonstrates that real-time analysis of documents returned from Web search engines is feasible. In fact, calling the Web search engines and downloading Web pages in parallel allows the Inquirus meta search engine to, on average, display the first result quicker than using a standard search engine. User feedback indicates that the display of real-time local context around query terms, and the highlighting of query terms in the documents when viewed, significantly improves the efficiency of searching the Web.