The problem of information navigation is discussed. Some available navigation technologies, both on and off the Web, are surveyed. The issues are looked at from both a publishing point of view ("How can I make my information useful to the world?") and a consumer point of view ("How can I access information useful to me?").
A case study of a navigation-centric server is presented. Highlights of the development process are given from requirements definition, through architecture design and implementation, all the way to user reaction and feedback.
An example of a navigation system built on manual technology is the daily newspaper. Newspapers have access to many sources of data, including externally provided services and internally directed reporters. They create relevant information by employing skilled content experts - editors - who both filter and organize the raw data. They deliver information by laying it out in an accessible format, putting it on paper, and using fleets of trucks and bicycles to move that paper to the consumers' doorsteps.
As with all connection and communication technologies, the value of a navigation system depends on the number of possible connections it can provide. A system that can connect one person to one data source intelligently is valuable, just as a fax machine that could only connect one field office to one home office would be valuable. Every new fax machine installed would make the existing fax machines more valuable. Similarly, every new data source handled, and every new delivery method supported, makes an existing navigation system more valuable.
This self-reinforcing dynamic means that developers of navigation systems are always on the lookout for ways to extend their reach. The World-Wide Web does an excellent job of giving many people easy access to huge amounts of data. In terms of the above three categories, it provides both an important new source of data and an important new means of information delivery. It therefore creates an opportunity to make existing navigation systems even more valuable.
More than simple physical access is required, however. Data comes in an ever-increasing number of formats. Tools are needed to translate from one format to another, to extract the meaningful fields from structured repositories, to maintain proper handshaking with providers of feeds such as newswires, and in general to act as interpreters on the Tower of Babel construction site that our computers have become.
Tools that provide both physical and logical access to data are required for a navigation system to work at all. For it to work well, the tools must be invisible. They should remove the artificial barriers between different types of data. Information-hungry users should be able to focus on the content being provided and not have to think about the mechanical details of transport or representation.
The relevance decision can also be made automatically. Software-based relevance calculation tools demonstrate the familiar strengths and weaknesses of computers. It's relatively easy to make them do a wonderful job of automating drudgery (e.g. "find all the files in this directory that contain the word 'aardvark'"), and much more difficult to get them to replace common sense (e.g. "let me know when something exciting gets posted").
Different tools work well with different amounts of data. If you only have a few megabytes of data to filter, the Unix string-search utility 'grep' does an acceptable job. However, it's usefulness breaks down as the data volume increases. The following table gives an idea of the kind of tools available and the volumes at which they are useful.
Data Volume Simplest Useful Tool <10 Meg grep <100 Meg Boolean search; hierarchical directory trees <1 Gig results ranking; manual keywording <10 Gig true relevance ranking; automatic categorization <100 Gig semantic libraries Terabytes ???
To be most useful, these tools need to go to wherever the users are. Since many users use email, there are tools to deliver relevant information by email. Since many users use fax machines, there are tools to deliver information by fax. Since many users use large software environments like Lotus Notes and the World-Wide Web, there are tools to deliver information into those environments.
This section discusses some of the design decisions we were faced with. It describes the choices we made and the reactions our Alpha and Beta users had to those choices.
Some of our users have large existing Web sites, so our server comes with tools (such as an indexing spider) and features (such as automatically ignoring HTML tags during indexing) to make it easy to search those sites. Other users were new to the Web, but had large amounts of ASCII and word-processing data that they wanted to distribute to remote locations. Our server supports them with features such as the ability to translate the text from many standard document formats into HTML.
For the most part, our customers like what we've done. The biggest request in this area is for access to more of the data in the existing Web documents. People want to be able to get at document meta-information (such as <TITLE> tags) in addition to the text contents of the documents.
However, the vast size of the installed base of clients, coupled with the advantages of conforming to existing standards, led us to a more limited but more deployable choice. We decided to only write server-side code, and to only use features that are supported by existing Web browsers. (We do require forms support on the browser.)
The feedback is mostly positive - our customers like being able to use their existing browsers. However, we do get requests for features that can only be implemented with client-side code. We are currently investigating ways to add extensions to existing browsers for users who want these features.
Our next thought was to implement our own HTTP server, probably by modifying one of the publicly available ones. The problem with this approach is that it needlessly bundles capabilities. Users should be able to choose the server that provides normal browsing services independently from the server that provides search and navigation.
Our decision was to implement a lightweight navigation-only HTTP server from scratch. It would typically run on the same machine as an 'ordinary' server such as httpd, but advertise on a different port. Links to the navigation server would be provided from the pages served by the ordinary server.
The early feedback is mixed. The independent server performs well, and users who are already running a Web server are happy to not have to reconfigure it. However, the need to administer two servers is unfortunate, especially for users who are new to the Web.
We decided to explicitly maintain state on our server. We use it to keep track of user information such as what they're interested in, what they're allowed to see, and what they're in the middle of looking for. One particularly important use of state is in searches over large databases (i.e. on the order of a million documents or more). Our server can return the first segment of the results to users, allowing them to start using the information, while it continues to search the balance of the data for more information.
While some of our users were initially skeptical about our use of state, most of them ended up liking the benefits. There were a number of requests for adding even more information (e.g. letting a user choose their favorite query form) to the state we keep on the server.
He can be reached by email at dglazer@verity.com
, or by phone at
415-960-7600.