Dynamic Reference Sifting: A Case Study In The Homepage Domain

Jonathan Shakes, Marc Langheinrich & Oren Etzioni
Department of Computer Science and Engineering
University of Washington
Seattle, Washington 98195-2350, USA
jshakes@cs.washington.edu, marclang@cs.washington.edu, etzioni@cs.washington.edu

Abstract

Robot-generated Web indices such as AltaVista are comprehensive but imprecise; manually generated directories such as Yahoo! are precise but cannot keep up with large, rapidly growing categories such as personal homepages or news stories on the American economy. Thus, if a user is searching for a particular page that is not catalogued in a directory, she is forced to query a Web index and manually sift through a large number of responses. Furthermore, if the page is not yet indexed, then the user is stymied. This paper presents Dynamic Reference Sifting -- a novel architecture that attempts to provide both maximally comprehensive coverage and highly precise responses in real time, for specific page categories.
To demonstrate our approach, we describe Ahoy! The Homepage Finder (http://www.cs.washington.edu/research/ahoy), a fielded Web service that embodies Dynamic Reference Sifting for the domain of personal homepages. Given a person's name and institution, Ahoy! filters the output of multiple Web indices to extract one or two references that are most likely to point to the person's homepage. If it finds no likely candidates, Ahoy! uses knowledge of homepage placement conventions, which it has accumulated from previous experience, to "guess" the URL for the desired homepage. The search process takes 9 seconds on average. On 74% of queries from our primary test sample, Ahoy! finds the target homepage and ranks it as the top reference; 9% of the targets are found by guessing the URL. In comparison, AltaVista can find 58% of the targets and ranks only 23% of these as the top reference.

1. Introduction

Information sources that are both comprehensive and precise (i.e., point exclusively to relevant Web pages) are a holy grail for Web information retrieval. Manually generated directories such as Yahoo! [Yahoo!, 1994] or The Web Developer's Virtual Library [CyberWeb, 1997] can be both precise and comprehensive, but only for categories that are relatively small and static (e.g., British Universities); a human editorial staff cannot keep up with large, rapidly growing categories (e.g., personal homepages, news stories on the American economy, academic papers on a topic, etc.). Thus, if a user is searching for a particular page that has yet to be cataloged in a directory, she is forced to query robot-generated Web indices such as AltaVista [Digital Equipment Corporation, 1995] or Lycos [Lycos, 1995]. These automatically generated indices are more comprehensive, but their output is notoriously imprecise. As a result, the user is forced to sift manually through a large number of Web index responses to find the desired reference.

Even such a laborious approach may not work. The automatically generated indices are not completely comprehensive [Selberg and Etzioni, 1995] for three reasons. First, each index has its own strategy for selecting which pages to include and which to ignore. Second, some time passes before recently minted pages are pointed to and subsequently indexed. Third, as the Web continues to grow, automatic indexers begin to reach their resource limitations.

This paper introduces a new architecture for information retrieval tools designed to address the above problems. We call this architecture Dynamic Reference Sifting (DRS). It includes the following key elements:

Reference source: A comprehensive source of candidate references (e.g., a Web index such as AltaVista).
Cross filter: A component that filters candidate references based on information from a second, orthogonal information source (e.g., a database of e-mail addresses).
Heuristic-based Filter: A component that increases precision by analyzing the candidates' textual content using domain-specific heuristics.
Buckets: A component that categorizes candidate references into ranked and labeled buckets of matches and near misses.
URL Generator: A component that synthesizes candidate URLs when steps 1 through 4 fail to yield viable candidates.
URL Pattern Extractor: A mechanism for learning about the general patterns found in URLs based on previous, successful searches. The patterns are used by the URL Generator.

DRS is by no means appropriate for all Web searches. It works best for classes of pages with the following characteristics:

They are too large or too dynamic to be exhaustively indexed by hand.
Their members are easily identifiable as belonging to the class.
Many, but not necessarily all, of their members are available using a more traditional reference source (e.g., keyword queries to AltaVista).
During a given search, a user is interested in very few, and often only one, members of the class, and the user can pose a query specific enough to exclude other members.

Examples of these classes include individuals' homepages; popular articles or academic papers on a single topic; product reviews; price lists; and transportation schedules.

Our fundamental claim is that for these classes of pages, DRS offers significant advantages over the currently popular approaches typified by Yahoo! and AltaVista. To support this claim we developed a DRS search tool for the personal homepage category, which we call Ahoy! The Homepage Finder. Ahoy! was first tested on the Web in February, 1996. The most recent version was deployed in July, 1996 and now fields over 2,000 queries per day.

The remainder of this paper is organized as follows: Sections 2 and 3 contrast current methods of finding personal homepages with the DRS approach to this problem. Section 4 evaluates the performance of DRS techniques in the homepage domain. We describe related work in Section 5, discuss future work in Section 6, and conclude in Section 7.

2. Current Methods of Finding Homepages

Many Web users have established personal homepages that contain information such as their address, phone numbers, schedules, directions for visitors, and so on. Unfortunately, homepages can be difficult to find. Most people use one of three methods.

Method 1: Directories.

Some Web services such as Yahoo! have attempted to create directories of homepages by relying on users to register their own pages, but such efforts have failed so far. As of November 1996, Yahoo! contains about 50,000 personal homepages. It is difficult to say how many personal homepages are on the Web, but it is clear that Yahoo!'s list represents only a small fraction of the total. For example, it contains only one percent of the roughly 30,000 personal homepages created by Netcom subscribers, and it contains between one and ten percent of homepages in other samples used to test Ahoy!.

Method 2: General-Purpose Indices.

AltaVista, Hotbot [Hotbot, 1996], and other general-purpose indices make query syntax available that is tuned to find people. This approach to finding personal homepages avoids the problems of manually creating a list, but the output of such searches frequently contains an inconveniently large number of references. For example, searching AltaVista for one of our authors using the query "Oren NEAR Etzioni" returns about 400 references. A similar search using Hotbot produces over 800 matches. A separate problem is that many search tool users do not bother to learn the specialized query syntax and thus request an even less precise search.

Method 3: Manual Search.

When you know enough about a person, you can find his homepage by first finding the Web site of the person's institution, then possibly searching down to the person's department, and finally locating a list or index of homepages for people at that site. Unfortunately, this method can be slow. If, for example, you were looking for a biologist named Peter Underhill at Stanford University, you might spend several minutes looking through Web pages of dozens of departments that might reasonably employ a biologist.

3. Using DRS to Find Homepages

**Figure 1. Ahoy!'s Web interface (simplified).**
Search for a Personal Homepage Given Name Family Name Organization (optional) Email address (optional) Country (optional)

Ahoy! represents a fourth approach to finding people's homepages. (Figure 1 shows the fields in Ahoy!'s Web interface.) We believe that Ahoy!'s DRS architecture makes it the most effective tool currently available on the Web for this task. Ahoy! combines the advantages of manually-generated directories -- their relevance and reliability -- with the advantage of general-purpose search engines like AltaVista -- their enormous pool of indexed pages. In fact, due to the DRS URL Generator, Ahoy! is able to find and return homepages that are not listed in any search index. Finally, Ahoy! provides the advantage of speed: when it returns a negative result (i.e., it reports that it cannot find a given homepage), it can save its user from scanning through tens or hundreds of "falsely positive" references returned by a general-purpose search engine. In addition, Ahoy! returns a result much faster than a manual search would.

The general design of Ahoy! is shown in Figure 2. Although the behavior of each component is specific to the homepage domain, the general structure of the system could be applied to other domains as well.

Figure 2. Architecture of Ahoy!

To find the search target, a DRS system begins by forwarding the user's input to a reference source and to other, information-providing sources whose output is orthogonal to that of the reference source. In the case of Ahoy!, user input includes the name of a person and, optionally, other descriptors such as his institution and country. Ahoy!'s reference source is the MetaCrawler parallel Web search service [Selberg and Etzioni, 1995]; Ahoy! gives the person's name to MetaCrawler and receives a long list of candidate Web pages in return. It simultaneously submits the name to two e-mail directory services [WhoWhere?], [IAF], and determines the URL of the target person's institution using an internal database, if the user has completed the appropriate input field. These combined sources of information serve as the input to the next step, filtering.

In the filtering step, a DRS system uses two types of filters to sift out irrelevant references and rank the remaining ones: cross-filtering and heuristic-based filtering. In the homepage domain, cross-filtering helps Ahoy! reject references based on information about the target person's institution and e-mail address. Heuristic-based filtering uses heuristics that deal with people's names and the way most homepages are structured.

Institutional Cross-Filter. If the user specifies the name of the target person's institution, Ahoy! attempts to determine the institution's URL by querying an internal database. Using this information, many irrelevant references can be discarded. The institutional cross-filter helps distinguish, for example, between pages related to the Eric Anderson who is at the University of Washington and those related to the Eric Anderson at Stanford. The filter tolerates ambiguous or incorrect information; for example, if the target person's institution is input as "UW," the filter favors URLs located at University of Washington, University of Wisconsin, University of Waterloo, and so on. Ahoy!'s ability to analyze URLs in a different way than it analyzes a page's text is one of the keys to its high precision. When Ahoy! searches for pages from our "Researchers sample" (described in Section 4), providing a correct institution enables Ahoy! to return 74% of the target pages as the top-ranked reference.
E-mail Cross-Filter. The results of searching e-mail directories can provide clues about the target person's login name. E-mail addresses are especially useful when someone's login name is radically different from her real name.
Name Database and Heuristics. Ahoy!'s filters contain rules about the various forms that names take, plus a database of common nicknames. This knowledge allows a search for Eric (sic) Selberg to find the homepage for Erik Selberg. It also accepts variations such as "Erik 'Speed' Selberg" and "Selberg, Erik", if encountered within the title or text of a Web page.
Homepage Heuristics. Ahoy!'s filters look for patterns found in typical homepages. During a search for Nancy Leveson, this filter would use pattern matching to prefer a page called "About Nancy Leveson" over one called "UVC Index of Nancy Leveson Videos." Ahoy!'s decision whether to categorize a given reference as a homepage is based entirely on the reference's title, URL, and a short textual extract (if the latter is supplied by the search engine). We conjecture that it is unnecessary to download the entire text of a reference or use natural language processing to assist categorization. This design decision also helps keep Ahoy!'s mean search time to 9 seconds, most of which is spent waiting for data from the reference source.

After filtering out obviously incorrect references, the DRS system ranks the remaining references into classes or categories called buckets. For the homepage domain, each bucket has three attributes that describe

the degree of match between the target person's name and the most similar name in the reference,
whether the reference's URL matches the specified institution and country, and
whether the page appears to be a homepage.

Clearly, a perfect match on all three attributes is best; the buckets corresponding to partial matches are ranked using domain-specific criteria that are somewhat subjective. Most important, the buckets are labeled for easy perusal. For example, in response to the query "Eric Anderson at M.I.T.," which returns no perfect matches, we present a bucket labeled "Pages bearing the right name at another institution."

By default, a DRS provides the user with the contents of only the most relevant buckets. In the majority of homepage searches, these relevant buckets contain one or two references. In searches where the system finds several pages that are equally likely to be relevant, however, it displays them all. If the system makes an incorrect ranking or users are curious, the contents of undisplayed, non-empty buckets are and available for viewing via hypertext link.

3.1 Pattern Extraction and Generation.

A key feature of the DRS architecture is its URL Pattern Extractor, which extracts general patterns from URLs found in successful searches, and records them for future reference. The URL Generator relies on the extracted patterns to automatically synthesize URLs when a search of its references source fails. We now describe how Ahoy! implements these features.

After Ahoy! filters and buckets references, there are two possibilities:

1) Success -- After filtering, references remain in one or more buckets that we consider highly relevant to the target. Alternatively,

2) Partial Success or Failure -- Either all references have been placed in relatively low-ranked buckets, or they have all been filtered out.

We examine each possibility separately below.

Success. If one or more references returned by MetaCrawler are placed in a highly-ranked bucket, Ahoy! declares success and writes the references immediately to user output. Experience with Ahoy! indicates that roughly half of queries fall under this category.

Ahoy! learns from this type of success: a URL Pattern Extractor analyzes the references that Ahoy! considers to be successful and extracts knowledge about their URLs. Each success is used as a positive learning example in the form of a triplet [name, institution, URL]. From such an example, Ahoy! learns 1) the exact identity of a server at some institution, and 2) the pathname of a homepage on that server. Ahoy! takes note of any substrings within the pathname that correspond to the person's name. The two pieces of knowledge are added to Ahoy!'s URL Pattern Database and may be used by the URL Generator in subsequent searches. We show an example of such use below.

3.2 Partial Success or Failure.

In less successful searches, the cross-filters and heuristic-based filters may remove all references from MetaCrawler output ("Failure"), or the bucketing algorithm may not place any references in a promising bucket ("Partial Success"). Under these circumstances, control passes to the URL Generator. This module speculatively generates URLs that, if they turn out to exist, are likely to be the target homepage.

For example, if Rob Jacob is at Tufts but his homepage cannot be successfully found from the references returned by MetaCrawler, then Ahoy! consults its URL Pattern Database. During previous searches, the URL Pattern Extractor recorded the existence of servers at Tufts called www.cs.tufts.edu and emerald.tufts.edu. It also noted that www.cs.tufts.edu contained a homepage under the pathname ~[username] and that emerald.tufts.edu contained one under people/[username]/. These two observations, combined with an informed guess as to what Rob Jacob's username might be, are sufficient to generate a set of speculative URLs for his homepage. If the e-mail services have been successful, the username can be extracted from an e-mail address. Otherwise the URL Generator guesses a series of usernames for Rob Jacobs using common username patterns (e.g., rjacob, robjacob, rob, etc.), combines them with server and pathname information, and attempts to retrieve pages from:


       www.cs.tufts.edu/~rjacob/ 
       www.cs.tufts.edu/~robjacob/ 
       www.cs.tufts.edu/~rob/ 
       www.cs.tufts.edu/~rj/ 
       www.cs.tufts.edu/~robj/ 
       www.cs.tufts.edu/~jacob/ 
       emerald.tufts.edu/people/rjacob/ 
       emerald.tufts.edu/people/robjacob/ 
       emerald.tufts.edu/people/rob/ 
       emerald.tufts.edu/people/rj/ 
       emerald.tufts.edu/people/robj/ 
       emerald.tufts.edu/people/jacob/

Any generated references that do in fact exist are passed through the filters, bucketed, and presented to the user if appropriate. We discuss the effectiveness of the URL Pattern Extractor and Generator in Section 4, Evaluation.

4. Evaluation of Ahoy!

Judging the performance of a Web search tool is difficult. However, any evaluation of an information retrieval technique is based in part on a measure of its precision and recall. Precision measures the ratio of relevant to irrelevant information returned during a search; recall measures the ratio of relevant information returned to relevant information in existence. Typically, efforts to increase recall have come at the expense of precision, and vice-versa.

Luckily, aspects of the DRS approach provide hope that in specific domains, high precision and high recall are not mutually exclusive. The domain-specific filtering and bucketing strategy can dramatically improve precision. At the same time, successfully generating URLs that are unavailable in standard Web indices can improve recall. The manner in which URLs are generated and tested makes it likely that generated references will be precise.

To demonstrate the success of Ahoy! in these areas, we present experiments that separately address precision and recall. We show that the URL-generating mechanism contributes significantly to recall. Finally, we present statistics that characterize the performance of the Ahoy! service as it is used by the Web community.

4.1 Notes about the experiments

To evaluate Ahoy!'s relative performance, we ask it and four other search services -- MetaCrawler, AltaVista, Hotbot, and Yahoo! -- to find a set of known homepages. MetaCrawler is an appropriate service for comparison because Ahoy! uses MetaCrawler as its reference source. Yahoo! represents the approach of hand-compiling a directory of personal homepages. We include AltaVista and Hotbot because their indices cover a large fraction of the Web, and each have query methods tailored to locating people.

Ideally, the sample of test queries would be taken from actual user queries to the fielded Ahoy! system. Unfortunately, sampling user queries is impractical. In many cases, it is difficult to determine objectively whether an obscure person mentioned by a query has a homepage or not. Users are also fond of searching for the homepages of celebrities: although the concept "celebrity" is difficult to define precisely, informal log analysis reveals that five to ten percent of searches are for celebrities. Such searches pose an unsolved testing challenge: which page(s) of the dozens devoted to an Oprah Winfrey should be considered her correct homepage(s)? Because we cannot determine "correct" answers to actual queries, we cannot judge Ahoy!'s performance on a representative sample of queries. (In Table 2, however, we present statistics that partially characterize Ahoy!'s performance on actual user queries.)

Instead of using usage logs as a source of test queries, we use two Web sites that list people along with the URLs of their homepages. One site, David Aha's list of Machine Learning and Case-Based Reasoning Home Pages, contains 582 valid URLs. We call these homepages the "Researchers sample." Another site, the netAddress Book of Transportation Professionals, contains 53 URLs for individuals' homepages. We call these homepages the "Transportation sample." We have chosen two independently-generated samples from disparate fields in order to evaluate Ahoy!'s scope and breadth.

As we test each competing service on the sample set, we use query syntax that maximizes the performance of the service in finding personal homepages. For example, if the target homepage belongs to John Cleary, we use AltaVista's "advanced syntax" to search for "John NEAR Cleary" and rank the results using the string "John Cleary." Queries to HotBot invoke its person-specific search option. In addition, in the tests for recall, each service returns a large number of references -- 200 each for AltaVista and HotBot, 20 for Yahoo!, and approximately 50 for MetaCrawler. Ahoy! typically returns one or two references.

Experiment I: Precision

Chart demonstrating precision Figure 3. Percent of homepage targets returned as the top-ranked reference by each search service. At this level of precision, Ahoy!'s performance is better than that of its nearest competitor by a factor of three.

The results in Figure 3 show that Ahoy!'s precision in the homepage domain is markedly stronger than that of general-purpose search services. The data demonstrate precision by showing the percent of test queries for which the target URL is at the top of the search results.

Another way of capturing precision is to calculate the average rank of the target URL within the output of successful searches (Table 1). A low rank is good, because that means the target reference will be displayed prominently near the top of the output. Conversely, a high rank suggests that irrelevant pages have been erroneously ranked higher than the target. A very high rank also lessens the likelihood that the user will bother to scroll down and find the target.

Average Rank
of Targets Standard Deviation
of Average Rank

Ahoy! 1.2 0.8

AltaVista 7.4 19.8

Hotbot 8.0 16.9

MetaCrawler 7.0 8.1

Yahoo! 1.3 0.5
Table 1. Precision of search for target homepages. The "Average Rank of Targets" column shows how many references a user will typically have to read through before finding the desired homepage, if the homepage is found. Ahoy! and Yahoo! place the average target page at the top of their output, while the other services tend to place the target page under several non-target references. The rank statistics are based on the Researchers sample. Statistics for the Transportation sample are comparable.

Under the average rank metric, Ahoy!'s precision is significantly higher than that of every search service except Yahoo!, a manually compiled directory of the Web. Yahoo!'s precision comes at the expense of recall, however, as demonstrated by Figure 4.

Experiment II: Recall

Chart demonstrating recall Figure 4. Percent of targets returned by each service. Despite returning only 1 or 2 references, Ahoy! returns a higher percentage of targets than its competitors. When Ahoy! returns a target not located by MetaCrawler, this reference comes from the URL Generator.

Figure 4 shows the number of targets found by various services, regardless of the targets' location in the output. On the Researchers sample, Ahoy! finds 9% more targets than its strongest competitor, MetaCrawler (This increase in recall is due to Ahoy!'s URL Generator). Furthermore, Ahoy!'s recall is significantly higher than that of AltaVista, Yahoo!, and Hotbot on both samples.

4.2 Contribution of the URL Generator.

Although the bulk of references returned by Ahoy! are extracted from its reference source, MetaCrawler, Ahoy! has higher recall than MetaCrawler due to its URL generator.

Data from the fielded Ahoy! system help to explain why the URL Generator is so effective. As of November, 1996, Ahoy!'s URL Pattern Extractor has learned 23,000 patterns from servers at 6,000 institutions. Although the number of institutions is small compared to the number of institutions with a Web presence, the institutions represent precisely those at which users have found homepages in past searches. In other words, information in the URL Pattern Database is skewed in a way that resembles the skew in Ahoy! user queries. As a result, the URL Generator has sufficient information to generate URLs for 34% of user queries in which the user provides a target institution. One or more of the URLs generated returns what Ahoy! considers to be the target homepage on 27% of attempts at generation. The significance of this 27% figure is remarkable when one considers that Ahoy! attempts generation only when it believes the target is missing from references returned by MetaCrawler. URL Generation is applied only to "tough" queries, and it finds pages that are unlikely to be found using any other method.

Experiment III: Cumulative Results

Searches over 3 months 53,000

Unique target names seen 42,000

Clients originating queries 20,000

Clients originating over 20 queries 200

Institutions in Ahoy!'s URL pattern DB 6000

Patterns in Ahoy!'s URL pattern DB 23,000

Searches for which user follows at least one link 26%

Searches considered "Unambiguously Successful" by Ahoy! 50%

Searches in which result is a generated URL unavailable from reference sources 2%

References provided in each search

Median 2

Mode 1

Mean 3.5

Standard Deviation 5.0

Search speed (seconds)

Median 9

Mode 6

Mean 13

Standard Deviation 23

Table 2. Cumulative Results. These data suggest that many people search with Ahoy! regularly. Users search for a wide variety of people at a wide range of institutions. In most cases, Ahoy! returns one or two references, but in some cases -- notably, searches for celebrities with dozens of "homepages" created by fans -- it returns many more. Data are based on Ahoy! operations during a 3-month period ending late October, 1996.

4.3 Discussion of Experiments

The design of experiments to compare the performance of different kinds of Web search tools is difficult. Below, we analyze our experiments and consider experimental biases both in favor and against Ahoy!.

The discrepancy between the small number of hits returned by Ahoy! and the larger number returned by the other services biases comparisons of recall against Ahoy!: the other services are, in effect, provided many more opportunities to "get the right answer". Furthermore, note that references displayed deep in the list are of questionable value; many users are unwilling to scroll past the first ten or twenty references returned. Thus, pure recall figures understate the value added by Ahoy!.

On the other hand, the experiments contain two biases in Ahoy!'s favor. First, Ahoy! has the advantage of being domain-specific. Second, queries to Ahoy! contain the name of the target institution, information not provided to the other services. Ahoy!'s domain-specific knowledge, used in the framework of its DRS architecture, is what makes it effective, so we cannot remove the first bias without changing the fundamental nature of Ahoy!. We might, however, remove the second bias by withholding institutional information from Ahoy! or by providing such information to the other services.

By withholding institutional information, we would prevent Ahoy! from using its full set of features, including the institutional cross-filter and the URL Generator. Even so, Ahoy!'s precision under these conditions remains over fifty percent higher than that of the other services. Without benefit of the URL Generator, Ahoy!'s recall is bounded by that of MetaCrawler, its reference source. However, both MetaCrawler and the "handicapped" Ahoy! have higher recall on both test samples than the other search services.

When we provide institutional information to the general-purpose search engines, the services' precision increases slightly, and their recall decreases significantly. Why does such additional information decrease performance? We speculate:

The name of the institution may not appear on the target homepage and thus may confuse the search engines' ranking algorithms.
The general query syntax appropriate for an institutional search often makes it impossible or impractical to take simultaneous advantage of a service's person-specific query syntax.
User-provided institutional information is often ambiguous or incorrect. Ahoy!'s institutional cross-filter is designed to be forgiving of imperfect institutional queries (a quality not demonstrated in this paper); general-purpose ranking algorithms do not have this luxury. For example, a user may input the query "Oren Etzioni at UW" to Ahoy!, but adding "UW" to an AltaVista query would only serve to confuse it.

Recall.
Precision.
Scope: What is the range of subjects about which the system may be queried?
Automation: How much human effort is required to create and maintain the system?
Time of Collection: Is the information gathered statically (in anticipation of queries) or dynamically (at the time of query)?

Services like AltaVista, Hotbot, Lycos [Mauldin and Leavitt, 1994], Infoseek [Randall, 1995] and OpenText [Open Text Corporation, 1995] use keyword search. In our five-dimensional categorization, such services are comprehensive, broad in scope, highly automated, but static and imprecise. To increase precision, some services use features like phrase searching, field-based searches, and case-sensitive searching. Features relevant to finding personal homepages include nearness operators in AltaVista, specialized "person-specific" queries in Hotbot, and automatic name recognition in Infoseek's Ultraseek [Infoseek, 1996]). Nevertheless, as our experimental results demonstrate, DRS can exceed both the precision and recall of such services in carefully circumscribed categories such as personal homepages. More efficient techniques of creating and distributing such indices have been explored by the Harvest Project [Brown et al, 1994], but improved precision and recall have not yet been demonstrated. The Harvest techniques have been used to build a demonstration search service, the WWW Home Pages Harvest Broker. The broker indexes only 110,000 homepages, few of which are personal homepages.

To increase precision, directories like Yahoo! and A2Z [Lycos A2Z, 1996] provide manually-created, categorized lists of information sources. Directories improve precision at the cost of reducing automation. This tradeoff is very effective for categories that are relatively small and static (e.g., British Universities) but a human editorial staff cannot keep up with large, rapidly growing categories (e.g., personal homepages, news stories on the American economy, academic papers on a topic, etc.). This paper measures Yahoo!'s success at maintaining a complete list of personal homepages. Similar performance is exhibited by other manually-created indices that are specialized on listing homepages, including People Pages, Net Citizens, Housernet, and many other sites.

E-mail address directory services are typically static lists, automatically generated by consulting a highly restricted set of information sources. Ahoy! uses two E-mail directories for cross-filtering: Internet Address Finder and WhoWhere?. Techniques used to recognize e-mail addresses (e.g., extracting author headers on USENET postings) rely on the addresses' highly standardized form. Such techniques are not sufficient to distinguish personal homepages, unfortunately, since personal homepage URLs and contents vary widely from site to site.

Some services resemble Ahoy! in that they locate information dynamically and provide information whose scope is limited to individual people. One such service, Netfind, [Schwartz, 1991] generates the information dynamically using the Internet protocols SMTP and finger as information sources. Similar to Netfind, Ahoy! dynamically gathers information about people, but the similarity ends there. Ahoy! uses a a radically different architecture and set of information sources. Furthermore, Netfind is restricted to e-mail addresses whereas Ahoy! locates homepages.

Unlike the techniques mentioned so far, Ahoy! uses aspects of machine learning to improve its performance over time. Many machine learning systems deployed on the Web ([Knoblock and Levy, 1995]) learn about their user's tastes or activities. Examples include WebWatcher [Armstrong et al, 1995] and Syskill & Webert [Pazzani, 1995]. In contrast, Ahoy! learns about the Web itself, pooling experience gained during multiple searches by multiple users. Ahoy! is also unlike many machine learning systems in that it performs well even without learned knowledge. As a result, it can attract users immediately and use their queries to bootstrap its automatically-generated knowledge base.

Ahoy!'s learned knowledge about the Web is restricted to patterns within URLs of static Web pages. Increasing amounts of Web information are only available from dynamic Web pages generated by CGI scripts; future work is necessary to build learning agents that automatically navigate dynamic Web pages. ShopBot [Doorenbos et al, 1996] is one example that automatically learns to extract information from various on-line vendors.

Ahoy! is a member of a large family of software agents investigated as part of the University of Washington's Internet Softbots project [Etzioni, 1993 and Etzioni and Weld, 1994]. Softbots (software robots) are intelligent agents that use software tools and services on a person's behalf. A person states what he wants accomplished and the softbot determines how and where to satisfy the request; the person is shielded from these details. In the case at hand, a person specifies an individual's name and affiliation --- the Ahoy! softbot is responsible for using Web resources (and filtering their output!) to locate the individual's homepage.

6. Future Work

In the homepage domain, we believe it is possible to gain better recall and faster speeds by further developing our URL pattern extraction and URL generation algorithms. For example, we would like to have Ahoy! "learn" the canonical form of login names at various institutions. Ahoy! might learn, for example, that the homepage URL of everyone at Brown University contains two or three of the person's initials. We plan to experiment with these ideas and to test Ahoy! more broadly against a wider variety of search engines, and using larger sets of test samples.

Finally, we would like to develop DRS systems for other domains to demonstrate the general nature of the DRS architecture and more clearly define its scope. To streamline this process, we are investigating partially automating the creation of a DRS system. In particular, we are exploring techniques that would enable a DRS to automatically create its heuristic-based filter by generalizing from examples.

7. Conclusions

The most popular Web search services fall into two categories: a Yahoo!-like, manually generated directory, or an AltaVista-like, automatically generated index. The former category is precise, and the latter is fairly comprehensive, but neither combines the two qualities. Dynamic Reference Sifting (DRS) represents a novel architecture for search that is highly effective for the classes of searches characterized in the introduction. By using domain-specific filtering and bucketing techniques, a DRS system can achieve all the precision of a manually-built directory. At the same time, automatic URL generation can make DRS even more comprehensive than an automatically compiled index. We have demonstrated the effectiveness of DRS using Ahoy!, a service that finds personal homepages, but we believe DRS is applicable to many other Web domains.

Acknowledgments

Ahoy! would not exist without help from many people, some of whom are mentioned below. Ed Lazowska provided the original idea to build a homepage-finding agent. David Salesin inspired us to filter the output of a general-purpose search engine. Erik Selberg provided red-carpet, back-door access to MetaCrawler. UW CSE computing staff provided the infrastructure to keep Ahoy! alive. Yahoo!, IAF, OKRA, WhoWhere, Christina DeMello, Chris Redmond, and Mike Conlon generously allowed use of their data. Ellen Spertus and our fellow softboticists helped test and improve early versions of Ahoy!. Keith Golden, Richard Golding, Nicholas Kushmerick, Tessa Lau, and others provided valuable comments on earlier drafts of this paper. This research was funded in part by Office of Naval Research grant 92-J-1946, by ARPA / Rome Labs grant F30602-95-1-0024, by a gift from Rockwell International Palo Alto Research, and by National Science Foundation grant IRI-9357772.

References

[Armstrong et al, 1995]: Robert Armstrong, Dayne Freitag, Thorsten Joachims, and Tom Mitchell. Web Watcher: A learning apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogenous, Distributed Resources, Stanford, CA, March 1995.

Return to Top of Page
Return to Technical Papers Index