Lessons Learned Implementing a
Navigation Server for the Web

David Glazer
Senior Architect
Verity Inc.

Abstract

The problem of information navigation is discussed. Some available navigation technologies, both on and off the Web, are surveyed. The issues are looked at from both a publishing point of view ("How can I make my information useful to the world?") and a consumer point of view ("How can I access information useful to me?").

A case study of a navigation-centric server is presented. Highlights of the development process are given from requirements definition, through architecture design and implementation, all the way to user reaction and feedback.

Information Navigation

Navigation Systems

tools to provide easy access to the right raw data
tools to transform the data into relevant information by filtering out noise and adding organizational structure
tools to deliver the information to the consumer

An example of a navigation system built on manual technology is the daily newspaper. Newspapers have access to many sources of data, including externally provided services and internally directed reporters. They create relevant information by employing skilled content experts - editors - who both filter and organize the raw data. They deliver information by laying it out in an accessible format, putting it on paper, and using fleets of trucks and bicycles to move that paper to the consumers' doorsteps.

As with all connection and communication technologies, the value of a navigation system depends on the number of possible connections it can provide. A system that can connect one person to one data source intelligently is valuable, just as a fax machine that could only connect one field office to one home office would be valuable. Every new fax machine installed would make the existing fax machines more valuable. Similarly, every new data source handled, and every new delivery method supported, makes an existing navigation system more valuable.

This self-reinforcing dynamic means that developers of navigation systems are always on the lookout for ways to extend their reach. The World-Wide Web does an excellent job of giving many people easy access to huge amounts of data. In terms of the above three categories, it provides both an important new source of data and an important new means of information delivery. It therefore creates an opportunity to make existing navigation systems even more valuable.

Computer-based Navigation Tools

Data access tools
In order for a computer to work with data, it must first be able to physically access the data. The providers of disk and network technology continue to put more and more megabytes within reach of the machines on our desks.
More than simple physical access is required, however. Data comes in an ever-increasing number of formats. Tools are needed to translate from one format to another, to extract the meaningful fields from structured repositories, to maintain proper handshaking with providers of feeds such as newswires, and in general to act as interpreters on the Tower of Babel construction site that our computers have become.
Tools that provide both physical and logical access to data are required for a navigation system to work at all. For it to work well, the tools must be invisible. They should remove the artificial barriers between different types of data. Information-hungry users should be able to focus on the content being provided and not have to think about the mechanical details of transport or representation.
Filtering and organization tools
Once data is accessible, filtering tools are needed to separate the wheat from the chaff. The job of these tools is to predict how well each particular piece of data would satisfy a user's information need. Note that these tools can be manual, providing the user with a summary of the data (e.g. its filename and file size) but leaving the actual relevance decision to a person.
The relevance decision can also be made automatically. Software-based relevance calculation tools demonstrate the familiar strengths and weaknesses of computers. It's relatively easy to make them do a wonderful job of automating drudgery (e.g. "find all the files in this directory that contain the word 'aardvark'"), and much more difficult to get them to replace common sense (e.g. "let me know when something exciting gets posted").
Different tools work well with different amounts of data. If you only have a few megabytes of data to filter, the Unix string-search utility 'grep' does an acceptable job. However, it's usefulness breaks down as the data volume increases. The following table gives an idea of the kind of tools available and the volumes at which they are useful.
```
Data Volume        Simplest Useful Tool
   <10 Meg         grep
   <100 Meg        Boolean search; hierarchical directory trees
   <1 Gig          results ranking; manual keywording
   <10 Gig         true relevance ranking; automatic categorization
   <100 Gig        semantic libraries
   Terabytes          ???
```
Information delivery tools
Once the data has been turned into relevant information, tools are needed to deliver it to the user. These tools include information servers, graphical user interfaces, and gateways to other applications such as email.
To be most useful, these tools need to go to wherever the users are. Since many users use email, there are tools to deliver relevant information by email. Since many users use fax machines, there are tools to deliver information by fax. Since many users use large software environments like Lotus Notes and the World-Wide Web, there are tools to deliver information into those environments.

A Navigation Server for the Web

This section discusses some of the design decisions we were faced with. It describes the choices we made and the reactions our Alpha and Beta users had to those choices.

Web as data source vs. Web as delivery vehicle
The Web provides both a huge new source of data and an easy to use wide area information distribution channel. We decided to take advantage of both capabilities.
Some of our users have large existing Web sites, so our server comes with tools (such as an indexing spider) and features (such as automatically ignoring HTML tags during indexing) to make it easy to search those sites. Other users were new to the Web, but had large amounts of ASCII and word-processing data that they wanted to distribute to remote locations. Our server supports them with features such as the ability to translate the text from many standard document formats into HTML.
For the most part, our customers like what we've done. The biggest request in this area is for access to more of the data in the existing Web documents. People want to be able to get at document meta-information (such as <TITLE> tags) in addition to the text contents of the documents.
Client vs. server
It was tempting to work on both sides of the wire. Adding code to the client would have allowed us to build a much friendlier UI, especially in the area of query building - allowing the user to express an information need. It would also have allowed us to build a more responsive delivery mechanism by letting the server push new information to the client as it became available.
However, the vast size of the installed base of clients, coupled with the advantages of conforming to existing standards, led us to a more limited but more deployable choice. We decided to only write server-side code, and to only use features that are supported by existing Web browsers. (We do require forms support on the browser.)
The feedback is mostly positive - our customers like being able to use their existing browsers. However, we do get requests for features that can only be implemented with client-side code. We are currently investigating ways to add extensions to existing browsers for users who want these features.
CGI vs. standalone server vs. cooperating server
Our first inclination was to implement our server using CGI. Customers would then be able to use the administration, access control, and logging features they were used to. We wouldn't have to deal with HTTP, either today or as it evolves. However, the performance implications of spawning a new process on every interaction were unattractive.
Our next thought was to implement our own HTTP server, probably by modifying one of the publicly available ones. The problem with this approach is that it needlessly bundles capabilities. Users should be able to choose the server that provides normal browsing services independently from the server that provides search and navigation.
Our decision was to implement a lightweight navigation-only HTTP server from scratch. It would typically run on the same machine as an 'ordinary' server such as httpd, but advertise on a different port. Links to the navigation server would be provided from the pages served by the ordinary server.
The early feedback is mixed. The independent server performs well, and users who are already running a Web server are happy to not have to reconfigure it. However, the need to administer two servers is unfortunate, especially for users who are new to the Web.
Stateless vs. stateful
HTTP is, by design, a connectionless protocol. This has led most HTTP servers to be stateless. However, it is possible to implement a stateful server that is fully HTTP compliant. The advantage of statelessness is simplicity and robustness - if you don't remember anything from request to request, there's less that can go wrong. The disadvantage is functionality - there are certain features that need server-side state to be efficiently implemented.
We decided to explicitly maintain state on our server. We use it to keep track of user information such as what they're interested in, what they're allowed to see, and what they're in the middle of looking for. One particularly important use of state is in searches over large databases (i.e. on the order of a million documents or more). Our server can return the first segment of the results to users, allowing them to start using the information, while it continues to search the balance of the data for more information.
While some of our users were initially skeptical about our use of state, most of them ended up liking the benefits. There were a number of requests for adding even more information (e.g. letting a user choose their favorite query form) to the state we keep on the server.

Conclusion

About the Author

He can be reached by email at dglazer@verity.com, or by phone at 415-960-7600.

Lessons Learned Implementing a Navigation Server for the Web

David Glazer Senior Architect Verity Inc.

Abstract

Information Navigation

Navigation Systems

Computer-based Navigation Tools

Data access tools

Filtering and organization tools

Information delivery tools

A Navigation Server for the Web

Web as data source vs. Web as delivery vehicle

Client vs. server

CGI vs. standalone server vs. cooperating server

Stateless vs. stateful

Conclusion

About the Author

Lessons Learned Implementing a
Navigation Server for the Web

David Glazer
Senior Architect
Verity Inc.