Finding What People Want: Experiences with the WebCrawler

Introduction
The Design of the WebCrawler
Observations and Lessons
Future Work
Conclusion
References

Abstract

In large distributed hypertext system like the World-Wide Web, users find resources by following hypertext links. As the size of the system increases the users must traverse increasingly more links to find what they are looking for, until precise navigation becomes impractical. The WebCrawler is a tool that solves these problems by indexing and automatically navigating the Web. It creates an index by an incomplete breadth-first traversal, then relies on an automatic navigation mechanism to fill in the holes. The WebCrawler indexes both document titles and document content using a vector space model. Users can issue queries directly to the pre-computed index or to a search program that explores new documents in real time. The database the WebCrawler builds is available through a search page on the Web. Experience with this index suggests that the WebCrawler is collecting enough information from the Web, and that while the current query interface is adequate, more sophisticated retrieval tools are necessary. Real-time searches, although not available to the public, are fast and seem well targeted towards what what the user wants.

Introduction

Imagine trying to find a book in a library without a card catalog. It is not an easy task! Finding specific documents in a distributed hypertext system like the World-Wide Web can be just as difficult. To get from one document to another users follow links that identify the documents. A user who wants to find a specific document in such a system must choose among the links, each time following links that may take her further from her desired goal. In a small, unchanging environment, it might be possible to design documents that did not have these problems. But the World-Wide Web is decentralized, dynamic, and diverse; navigation is difficult, and finding information can be a challenge. This problem is called the resource discovery problem.

The WebCrawler is a tool that solves the resource discovery problem in the specific context of the World-Wide Web. It provides a fast way of finding resources by maintaining an index of the Web that can be queried for documents about a specific topic. Because the Web is constantly changing and indexing is done periodically, the WebCrawler includes a second searching component that automatically navigates the Web on demand. The WebCrawler is best described as a "Web robot." It accesses the Web one document at a time, making local decisions about how best to proceed. In contrast to more centralized approaches to indexing and resource discovery, such as ALIWEB and Harvest [Koster, Bowman], the WebCrawler operates using only the infrastructure that makes the Web work in the first place: the ability of clients to retrieve documents from servers.

Although the WebCrawler is a single piece of software, it has two different functions: building indexes of the Web and automatically navigating on demand. For Internet users, these two solutions correspond to different ways of finding information. Users can access the centrally maintained WebCrawler Index using a Web browser like Mosaic, or they can run the WebCrawler client itself, automatically searching the Web on their own.

As an experiment in Internet resource discovery, the WebCrawler Index is available on the World-Wide Web. The index covers nearly 50,000 documents distributed on over 9000 different servers, answers over 6000 queries per day and is updated weekly. The success of this service shows that non-navigational resource discovery is important and validates some of the design choices made in the WebCrawler. Three of these choices stand out: content-based indexing to provide a high-quality index, breadth-first searching to create a broad index, and robot-like behavior to include as many servers as possible.

The first part of this paper describes the architecture of the WebCrawler itself, and some of the tradeoffs made in its design. The second part of the paper discusses some of the experience I have gained in operating the WebCrawler Index.

The Design of the WebCrawler

The most important feature of the World-Wide Web today is its decentralized nature. Anyone can add servers, documents, and hypertext links. For search tools like the WebCrawler, such a dynamic organization means that discovering new documents is an important piece of the design. For the WebCrawler itself, "discovering documents" means learning their identities; on the Web, those identities currently take the form of Uniform Resource Locators (URLs). The URLs can refer to nearly any kind of retrievable network resource; once the WebCrawler has a URL it can decide whether the document is worth searching and retrieve it if appropriate.

After retrieving a document, the WebCrawler performs three actions: it marks the document as having been retrieved, deciphers any outbound links (href's), and indexes the content of the document. All of these steps involve storing information in a database.

Figure 1: Software components of the WebCrawler

The WebCrawler has a software component for each of the activities described above; the components fit together as illustrated in Figure 1. In this architecture, the search engine directs the WebCrawler's activities. It is responsible for deciding which new documents to explore and for initiating their retrieval. The database handles the persistent storage of the document metadata, the links between documents, and the full-text index. The "agents" are responsible for retrieving the documents from the network at the direction of the search engine. Finally, the Query Server implements the query service provided to the Internet. Each of these components is described in detail below.

The Search Engine

The WebCrawler discovers new documents by starting with a known set of documents, examining the outbound links from them, following one of the links that leads to a new document, and then repeating the whole process. Another way to think of it is that the Web is a large directed graph and that the WebCrawler is simply exploring the graph using a graph traversal algorithm.

Figure 2 shows an example of the Web as a graph. Imagine that the WebCrawler has already visited document A on Server 1 and document E on Server 3 and is now deciding which new documents to visit. Document A has links to document B, C and E, while document E has links to documents D and F. The WebCrawler will select one of documents B, C or D to visit next, based on the details of the search it is executing.

The search engine is responsible not only for determining which documents to visit, but which types of documents to visit. Files that the WebCrawler cannot index, such as pictures, sounds, PostScript or binary data are not retrieved. If they are erroneously retrieved, they are ignored during the indexing step. This file type discrimination is applied during both kinds of searching. The only difference between running the WebCrawler in indexing mode and running it in real-time search mode is the discovery strategy employed by the search engine.

Figure 2: The Web as a graph.

Indexing mode: In indexing mode, the goal is to build an index of as much of the Web as possible. If the WebCrawler Index has enough space for 50,000 documents, which documents should those be? For a Web index, one solution is that those documents should come from as many different servers as possible. The WebCrawler takes the following approach: it uses a modified breadth-first algorithm to ensure that every server has at least one document represented in the index. The strategy is effective. The most frequent feedback about the WebCrawler is that it has great coverage and that nearly every server is represented!

In detail, a WebCrawler indexing run proceeds as follows: every time a document on a new server is found, that server is placed on a list of servers to be visited right away. Before any other documents are visited, a document on each of the new servers is retrieved and indexed. When all known servers have been visited, indexing proceeds sequentially through a list of all servers until a new one is found, at which point the process repeats. While indexing, the WebCrawler runs either for a certain amount of time, or until it has retrieved some number of documents. In the current implementation, the WebCrawler builds an index at the rate of about 1000 documents an hour on a 486-based PC running NEXTSTEP.

Real-time search mode: In real-time search mode, where the goal is to find documents that are most similar to a user's query, the WebCrawler uses a different search algorithm. The intuition behind the algorithm is that following links from documents that are similar to what the user wants is more likely to lead to relevant documents than following any link from any document. This intuition roughly captures the way people navigate the Web: they find a document about a topic related to what they are looking for and follow links from there.

To find an initial list of similar documents, the WebCrawler runs the user's query against its index. A single document can also be used as a starting point, but using the index is much more efficient. From the list, the most relevant documents are noted, and any unexplored links from those documents are followed. As new documents are retrieved, they are added to the index, and the query is re-run. The results of the query are sorted by relevance, and new documents near the top of the list become candidates for further exploration. The process is iterated either until the WebCrawler has found enough similar documents to satisfy the user or until a time limit is reached.

One problem with this approach is that the WebCrawler blindly follows links from documents, possibly leading it down an irrelevant path. For instance, if the WebCrawler is searching for pages about molecular biology it may come across a page that has links to both molecular biology resources and unrelated network resources. When people navigate, they choose links based on the anchor text, the words that describe a link to another document, and tend to follow a directed path to their destination. Ideally, the WebCrawler should choose among several of these links, preferring the one that made the most sense. Although the WebCrawler's reasoning ability is somewhat less than that of a human, it does have a basis for evaluating each link: the similarity of the anchor text to the user's query. To evaluate this similarity, the WebCrawler makes a small full-text index from the anchor text in a document and applies the users query to select the most relevant link. Searching the anchor text in this way works well, but anchor texts are usually short and full-text indexing does not work as well as it could. More sophisticated full-text tools would help greatly, particularly the application of a thesaurus to expand the anchor text.

The idea of following documents from other, similar ones was first demonstrated to work in the Fish search [DeBra]. The WebCrawler extends that concept to initiate the search using the index, and to follow links in an intelligent order.

Agents

To actually retrieve documents from the Web, the search engine invokes "agents." The interface to an agent is simple: "retrieve this URL." The response from the agent to the search engine is either an object containing the document content or an explanation of why the document could not be retrieved. The agent uses the CERN WWW library (libWWW), which gives it the ability to access several types of content with several different protocols, including HTTP, FTP and Gopher [Berners-Lee].

Since waiting for servers and the network is the bottleneck in searching, agents run in separate processes and the WebCrawler employs up to 15 in parallel. The search engine decides on a new URL, finds a free agent, and asks the agent to retrieve that URL. When the agent responds, it is given new work to do. As a practical matter, running agents in separate processes helps isolate the main WebCrawler process from memory leaks and errors in the agent and in libWWW.

The Database

The WebCrawler's database is comprised of two separate pieces: a full-text index and a representation of the Web as a graph. The database is stored on disk, and is updated as documents are added. To protect the database from system crashes, updates are made under the scope of transactions that are committed every few hundred documents.

The full-text index is currently based on NEXTSTEP's IndexingKit [NeXT]. The index is inverted to make queries fast: looking up a word produces a list of pointers to documents that contain that word. More complex queries are handled by combining the document lists for several words with conventional set operations. The index uses a vector-space model for handling queries [Salton].

To prepare a document for indexing, a lexical analyzer breaks it down into a stream of words that includes tokens from both the title and the body of the document. The words are run through a "stop list" to prevent common words from being indexed, and they are weighted by their frequency in the document divided by their frequency in a reference domain. Words that appear frequently in the document, and infrequently in the reference domain are weighted most highly, while words that appear infrequently in either are given lower weights. This type of weighting is commonly called peculiarity weighting.

The other part of the database stores data about documents, links, and servers. Entire URLs are not stored; instead, they are broken down into objects that describe the server and the document. A link in a document is simply a pointer to another document. Each object is stored in a separate btree on disk; documents in one, servers in another, and links in the last. Separating the data in this way allows the WebCrawler to scan the list of servers quickly to select unexplored servers or the least recently accessed server.

The Query Server

The Query Server implements the WebCrawler Index, a search service available via a document on the Web [Pinkerton]. It covers nearly 50,000 documents distributed across 9000 different servers, and answers over 6000 queries each day. The query model it presents is a simple vector-space query model on the full-text database described above. Users enter keywords as their query, and the titles and URLs of documents containing some or all of those words are retrieved from the index and presented to the user as an ordered list sorted by relevance. In this model, relevance is the sum, over all words in the query, of the product of the word's weight in the document and its weight in the query, all divided by the number of words in the query. Although simple, this interface is powerful and can find related documents with ease. Some sample query output is shown in below in Figure 3.

WebCrawler Search Results

Search results for the query "molecular biology human genome project":

1000 Guide to Molecular Biology Databases
0754 Guide to Molecular Biology Databases
0562 Biology Links
0469 bionet newsgroups
0412 Motif BioInformatics Server
0405 LBL Catalog of Research Abstracts (1993)
0329 Molecular Biology Databases
0324 GenomeNet WWW server
0317 DOE White Paper on Bio-Informatics
0292 Molecular biology
0277 Oxford University Molecular Biology Data Centre Home Page
0262 Biology Internet Connections
0244 Harvard Biological Laboratories - Genome Research
0223 About the GenomeNet
0207 Biology Index

Figure 3: Sample WebCrawler results. For brevity, only the first 15 of 50 results are shown. The numbers down the left side of the results indicate the relevance of the document to the query, normalized to 1000 for the most relevant document. At the top of the list, two documents have the same title. These are actually two different documents, with different URLs and different weights.

The time required for the Query Server to respond to a query averages about an eighth of a second. That interval includes the time it takes to parse the query, call the database server, perform the query, receive the answer, format the results in HTML and return them to the HTTP server. Of that time, about half is spent actually performing the query.

Observations and Lessons

The WebCrawler has been active on the Web for the last six months. During that time, it has indexed thousands of documents, and answered nearly a quarter of a million queries issued by over 23,000 different users. This experience has put me in touch with many users and taught me valuable lessons. Some of the more important ones are summarized here. My conclusions are based on direct interaction with users, and on the results of an on-line survey of WebCrawler users that has been answered by about 500 people.

Full-text indexing is necessary

The first lesson I learned is that content-based indexing is necessary, and works very well. At the time the WebCrawler was unleashed on the Web, only the RBSE Spider indexed documents by content [Eichmann]. Other indexes like the Jumpstation and WWWW did not index the content of documents, relying instead on the document title inside the document, and anchor text outside the document [Fletcher, McBryan].

Indexing by titles is a problem: titles are an optional part of an HTML document, and 20% of the documents that the WebCrawler visits do not have them. Even if that figure is an overestimate of the missing titles in the network as a whole, basing an index only on titles would omit a significant fraction of documents from the index. Furthermore, titles don't always reflect the content of a document.

By indexing titles and content, the WebCrawler captures more of what people want to know. Most people who use the WebCrawler find what they are looking for, although it takes them several tries. The most common request for new features in the WebCrawler is for more advanced query functionality based on the content index.

Precision is the limiting factor in finding documents

Information retrieval systems are usually measured in two ways: precision and recall [Salton]. Precision measures how well the retrieved documents match the query, while recall indicates what fraction of the relevant documents are retrieved by the query. For example, if an index contained ten documents, five of which were about dolphins, then a query for `dolphin-safe tuna' that retrieved four documents about dolphins and two others would have a precision of 0.66 and a recall of 0.80.

Recall is adequate in the WebCrawler, and in the other indexing systems available on the Internet today: finding enough relevant documents is not the problem. Instead, precision suffers because these systems give many false positives. Documents returned in response to a keyword search need only contain the requested keywords and may not be what the user is looking for. Weighting the documents returned by a query helps, but does not completely eliminate irrelevant documents.

Another factor limiting the precision of queries is that users do not submit well-focused queries. In general, queries get more precise as more words are added to them. Unfortunately, the average number of words in a query submitted to the WebCrawler is 1.5, barely enough to narrow in on a precise set of documents. I am currently investigating new ways to refine general searches and to give users the ability to issue more precise queries.

Breadth-first searching at the server level builds a useful index

The broad index built by the breadth-first search ensures that every server with useful content has several pages represented in the index. This coverage is important to users, as they can usually naviagte within a server more easily than navigating across servers. If a search tool identifies a server as having some relevant information, users will probably be able to find what they are looking for.

Using this strategy has another beneficial effect: indexing documents from one server at a time spreads the indexing load among servers. In a run over the entire Web, each server might see an access every few hours, at worst. This load is hardly excessive, and for most servers is lost in the background of everyday use. When the WebCrawler is run in more restricted domains (for instance, here at the University of Washington), each server will see more frequent accesses, but the load is still less than that of other search strategies.

Robot-like behavior is necessary now

The main advantage of the WebCrawler is that it is able to operate on the the Web today. It uses only the protocols that make the Web operate and does not require any more organizational infrastructure to be adopted by service providers.

Still, Web robots are often criticized for being inefficient and for wasting valuable Internet resources because of their uninformed, blind indexing. Systems like ALIWEB and Harvest are better in two ways. First, they use less network bandwidth than robots by locally indexing and transferring summaries only once per server, instead of once per document. Second, they allow server administrators to determine which documents are worth indexing, so that the indexing tools need not bother with documents that are not useful.

Although Web robots are many times less efficient than they could be, the indexes they build save users from following long chains before they find a relevant document, thus saving the Internet bandwidth. For example, if the WebCrawler indexes 40,000 documents and gets 5000 queries a day, and if each query means the user will retrieve just three fewer documents than she otherwise would have, then it will take about eight days for the WebCrawler's investment in bandwidth to be paid back.

Another advantage of the document-by-document behavior of Web robots is that they can apply any indexing algorithm because they have the exact content of the document to work from. Systems that operate on summaries of the actual content are dependent on the quality of those summaries, and to a lesser extent, on the quality of their own indexing algorithm. One of the most frequent requests for new WebCrawler features is a more complex query engine. This requires a more complex indexing engine, for which original document content may be necessary. Putting such indexing software on every server in the world may not be feasible.

To help robots avoid indexing useless or unintelligible documents, Martin Koster has proposed a standard that helps guide robots in their choice of documents [Koster2]. Server administrators can advise robots by specifying which documents are worth indexing in a special "robots.txt" document on their server. This kind of advice is valuable to Web robots, and increases the quality of their indexes. In fact, some administrators have gone so far as to create special overview pages for Web robots to retrieve and include.

Widespread client-based searching is not yet practical

The future of client-initiated searching, like running the WebCrawler in on-demand mode, is still in doubt because of the load it places on the network. It is one thing for a few robots to be building indexes and running experiments; it is quite another for tens of thousands of users to be unleashing multiple agents under robotic control. Fortunately, most robot writers seem to know this and have declined to release their software publicly.

The key difference between running a private robot and running an index-building one is that the results of the private robot are only presented to one person. They are not stored and made available for others, so the cost of building them is not amortized across many queries.

Future Work

In the short term, the WebCrawler will evolve in three ways. First, I am exploring better ways to do client-based searching. This work centers on more efficient ways to determine which documents are worth retrieving and strategies that allow clients to perform searches themselves and send the results of their explorations back to the main WebCrawler index. The rationale behind the second idea is that if the results of these explorations are shared, then searching by clients may be more feasible.

Second, I am exploring possibilities for including a more powerful full-text engine in the WebCrawler. Several new features are desirable; the two most important are proximity searching and query expansion and refinement using a thesaurus. Proximity searching allows several words to be considered as a phrase, instead of as independent keywords. Using a thesaurus is one way to help target short, broad queries: a query of biology may find many documents containing the word biology, but when expanded by a thesaurus to include more terms about biology, the query will more precisely recognize documents about biology. Finally, I plan on working with the Harvest group to see if the WebCrawler can fit into their system as a broker [Bowman].

Conclusions

The WebCrawler is a useful Web searching tool. It does not place an undue burden on individual servers while building its index. In fact, the frequent use of its index by thousands of people saves the Internet bandwidth. The real-time search component of the WebCrawler is effective at finding relevant documents quickly, but it has the aforementioned problem of not scaling well given the limited bandwidth of the Internet.

In building the WebCrawler, I have learned several lessons. First, full-text indexing is worth the extra time and space it takes to create and maintain the index. In a Web robot, there is no additional load network load imposed by full-text index; the load occurs only at the server. Second, when large numbers of diverse documents are present together in an index, achieving good query precision is difficult. Third, using a breadth-first search strategy is a good way to build a Web-wide index.

The most important lesson I have learned from the WebCrawler and other robots like it is that they are completely consistent with the character of the decentralized Web. They require no centralized decision making and no participation from individual site administrators, other than their compliance with the protocols that make the Web operate in the first place. This separation of mechanism and policy allows each robot to pick its policy and offer its own unique features to users. With several choices available, users are more likely to find a search tool that matches their immediate needs, and the technology will ultimately mature more quickly.

Acknowledgements

I would like to thank Dan Weld for his support and many insightful discussions. I would also like to thank the thousands of WebCrawler users on the Internet: they have provided invaluable feedback, and put up with many problems in the server. I'm grateful to Tad Pinkerton and Rhea Tombropoulos for their editing assistance.

References

[Bowman] Bowman, C. M. et al. The Harvest Information Discovery and Access System. Second International World-Wide Web Conference, Chicago, 1994.

[Berners-Lee] Berners-Lee, T., Groff, J., Frystyk, H. Status of the World-Wide Web Library of Common Code

[DeBra] DeBra, P. and Post, R. Information Retrieval in the World-Wide Web: Making Client-based searching feasible, , R. Cailliau, O. Nierstrasz, M. Ruggier (eds.), Geneva, 1994.

[Eichmann] Eichmann, D. The RBSE Spider - Balancing Effective Search Against Web Load, The Jumpstation.

[McBryan] McBryan, O. GENVL and WWWW: Tools for Taming the Web, , R. Cailliau, O. Nierstrasz, M. Ruggier (eds.), Geneva, 1994.

[Koster] Koster, M. ALIWEB: Archie-Like Indexing in the Web, Guidelines for Robot Writers.

[NeXT] NeXT Computer, Inc. NEXTSTEP General Reference, Volume 2. Addison-Wesley, 1992.

[Pinkerton] Pinkerton, C. B. The WebCrawler Index

[Salton] Salton, G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley, 1989.

Brian Pinkerton is a Ph.D. student in the Department of Computer Science and Engineering at the University of Washington in Seattle. He is the author and operator of the WebCrawler, a popular Internet search tool. His professional interests include network information discovery, information retrieval, and applications of computer science to molecular biology. He recently returned from a four year leave of absence, during which he worked as a software engineer at NeXT Computer.

Brian Pinkerton
bp@cs.washington.edu