Lessons Learned Implementing a
    Navigation Server for the Web

    David Glazer
    Senior Architect
    Verity Inc.

    Abstract

      The World-Wide Web provides an excellent solution to the problem of making information available to a large user community, including users who have very little familiarity with the information being provided. This solution has created a new need - tools for helping users find their way through the plethora of available content.

      The problem of information navigation is discussed. Some available navigation technologies, both on and off the Web, are surveyed. The issues are looked at from both a publishing point of view ("How can I make my information useful to the world?") and a consumer point of view ("How can I access information useful to me?").

      A case study of a navigation-centric server is presented. Highlights of the development process are given from requirements definition, through architecture design and implementation, all the way to user reaction and feedback.


    Information Navigation

    Information navigation is all about making connections between people and information. The more raw data that people have access to, the more important the quality of the navigation tools become. Navigation, done right, tames the seas of data into useful sources of information.

    Navigation Systems

    We have found it useful to break the information navigation problem down as follows. Navigation is required when a person has an information need. Navigation systems satisfy that need by combining three types of tools.

    • tools to provide easy access to the right raw data
    • tools to transform the data into relevant information by filtering out noise and adding organizational structure
    • tools to deliver the information to the consumer

    An example of a navigation system built on manual technology is the daily newspaper. Newspapers have access to many sources of data, including externally provided services and internally directed reporters. They create relevant information by employing skilled content experts - editors - who both filter and organize the raw data. They deliver information by laying it out in an accessible format, putting it on paper, and using fleets of trucks and bicycles to move that paper to the consumers' doorsteps.

    As with all connection and communication technologies, the value of a navigation system depends on the number of possible connections it can provide. A system that can connect one person to one data source intelligently is valuable, just as a fax machine that could only connect one field office to one home office would be valuable. Every new fax machine installed would make the existing fax machines more valuable. Similarly, every new data source handled, and every new delivery method supported, makes an existing navigation system more valuable.

    This self-reinforcing dynamic means that developers of navigation systems are always on the lookout for ways to extend their reach. The World-Wide Web does an excellent job of giving many people easy access to huge amounts of data. In terms of the above three categories, it provides both an important new source of data and an important new means of information delivery. It therefore creates an opportunity to make existing navigation systems even more valuable.

    Computer-based Navigation Tools

    • Data access tools

      In order for a computer to work with data, it must first be able to physically access the data. The providers of disk and network technology continue to put more and more megabytes within reach of the machines on our desks.

      More than simple physical access is required, however. Data comes in an ever-increasing number of formats. Tools are needed to translate from one format to another, to extract the meaningful fields from structured repositories, to maintain proper handshaking with providers of feeds such as newswires, and in general to act as interpreters on the Tower of Babel construction site that our computers have become.

      Tools that provide both physical and logical access to data are required for a navigation system to work at all. For it to work well, the tools must be invisible. They should remove the artificial barriers between different types of data. Information-hungry users should be able to focus on the content being provided and not have to think about the mechanical details of transport or representation.

    • Filtering and organization tools

      Once data is accessible, filtering tools are needed to separate the wheat from the chaff. The job of these tools is to predict how well each particular piece of data would satisfy a user's information need. Note that these tools can be manual, providing the user with a summary of the data (e.g. its filename and file size) but leaving the actual relevance decision to a person.

      The relevance decision can also be made automatically. Software-based relevance calculation tools demonstrate the familiar strengths and weaknesses of computers. It's relatively easy to make them do a wonderful job of automating drudgery (e.g. "find all the files in this directory that contain the word 'aardvark'"), and much more difficult to get them to replace common sense (e.g. "let me know when something exciting gets posted").

      Different tools work well with different amounts of data. If you only have a few megabytes of data to filter, the Unix string-search utility 'grep' does an acceptable job. However, it's usefulness breaks down as the data volume increases. The following table gives an idea of the kind of tools available and the volumes at which they are useful.

      Data Volume        Simplest Useful Tool
         <10 Meg         grep
         <100 Meg        Boolean search; hierarchical directory trees
         <1 Gig          results ranking; manual keywording
         <10 Gig         true relevance ranking; automatic categorization
         <100 Gig        semantic libraries
         Terabytes          ???
      
    • Information delivery tools

      Once the data has been turned into relevant information, tools are needed to deliver it to the user. These tools include information servers, graphical user interfaces, and gateways to other applications such as email.

      To be most useful, these tools need to go to wherever the users are. Since many users use email, there are tools to deliver relevant information by email. Since many users use fax machines, there are tools to deliver information by fax. Since many users use large software environments like Lotus Notes and the World-Wide Web, there are tools to deliver information into those environments.

    A Navigation Server for the Web

    We have been building navigation systems made up of data access, filtering and organization, and information delivery tools for several years. The rapidly increasing prominence of the World-Wide Web made it clear that we needed to add Web capabilities to our arsenal of tools. We set out to do so last spring.

    This section discusses some of the design decisions we were faced with. It describes the choices we made and the reactions our Alpha and Beta users had to those choices.

    • Web as data source vs. Web as delivery vehicle

      The Web provides both a huge new source of data and an easy to use wide area information distribution channel. We decided to take advantage of both capabilities.

      Some of our users have large existing Web sites, so our server comes with tools (such as an indexing spider) and features (such as automatically ignoring HTML tags during indexing) to make it easy to search those sites. Other users were new to the Web, but had large amounts of ASCII and word-processing data that they wanted to distribute to remote locations. Our server supports them with features such as the ability to translate the text from many standard document formats into HTML.

      For the most part, our customers like what we've done. The biggest request in this area is for access to more of the data in the existing Web documents. People want to be able to get at document meta-information (such as <TITLE> tags) in addition to the text contents of the documents.

    • Client vs. server

      It was tempting to work on both sides of the wire. Adding code to the client would have allowed us to build a much friendlier UI, especially in the area of query building - allowing the user to express an information need. It would also have allowed us to build a more responsive delivery mechanism by letting the server push new information to the client as it became available.

      However, the vast size of the installed base of clients, coupled with the advantages of conforming to existing standards, led us to a more limited but more deployable choice. We decided to only write server-side code, and to only use features that are supported by existing Web browsers. (We do require forms support on the browser.)

      The feedback is mostly positive - our customers like being able to use their existing browsers. However, we do get requests for features that can only be implemented with client-side code. We are currently investigating ways to add extensions to existing browsers for users who want these features.

    • CGI vs. standalone server vs. cooperating server

      Our first inclination was to implement our server using CGI. Customers would then be able to use the administration, access control, and logging features they were used to. We wouldn't have to deal with HTTP, either today or as it evolves. However, the performance implications of spawning a new process on every interaction were unattractive.

      Our next thought was to implement our own HTTP server, probably by modifying one of the publicly available ones. The problem with this approach is that it needlessly bundles capabilities. Users should be able to choose the server that provides normal browsing services independently from the server that provides search and navigation.

      Our decision was to implement a lightweight navigation-only HTTP server from scratch. It would typically run on the same machine as an 'ordinary' server such as httpd, but advertise on a different port. Links to the navigation server would be provided from the pages served by the ordinary server.

      The early feedback is mixed. The independent server performs well, and users who are already running a Web server are happy to not have to reconfigure it. However, the need to administer two servers is unfortunate, especially for users who are new to the Web.

    • Stateless vs. stateful

      HTTP is, by design, a connectionless protocol. This has led most HTTP servers to be stateless. However, it is possible to implement a stateful server that is fully HTTP compliant. The advantage of statelessness is simplicity and robustness - if you don't remember anything from request to request, there's less that can go wrong. The disadvantage is functionality - there are certain features that need server-side state to be efficiently implemented.

      We decided to explicitly maintain state on our server. We use it to keep track of user information such as what they're interested in, what they're allowed to see, and what they're in the middle of looking for. One particularly important use of state is in searches over large databases (i.e. on the order of a million documents or more). Our server can return the first segment of the results to users, allowing them to start using the information, while it continues to search the balance of the data for more information.

      While some of our users were initially skeptical about our use of state, most of them ended up liking the benefits. There were a number of requests for adding even more information (e.g. letting a user choose their favorite query form) to the state we keep on the server.

    We are continuing to gather feedback and refine the capabilities of our server. We expect to eventually offer a family of products to satisfy the varying needs of our users.

    Conclusion

    The popularity of the World-Wide Web provides a challenge and an opportunity for all researchers and developers working on information navigation systems. We were pleased to discover how valuable good navigation tools are on this new electronic frontier. However, there's a long way left to go. Currently available tools, including our navigation server, are only a good start. Many of the hard and interesting problems presented by the distributed multimedia multi-author rapidly evolving world that is the Web have yet to be addressed.


    About the Author

    David Glazer is a founder and senior architect at Verity, Inc, one of the leading suppliers of text retrieval and information delivery technology and products. He has 17 years of industry experience, including being the author of the Lotus Manuscript word processing program. Most recently, he has been working on redesigning Verity's concept retrieval technology and applying it to the world of the Internet. Mr. Glazer has a bachelor's degree in physics from the Massachusetts Institute of Technology.

    He can be reached by email at dglazer@verity.com, or by phone at 415-960-7600.