Information Discovery and Distillation in Government:

An Experience Report

Second International WWW Conference '94
Chicago, IL
October 1994


Paul D. Boyer












Abstract

By now, everyone knows about some of the kinds of services publicly available on the World Wide Web, the connection of computers available to all Internet members. This paper will raise the curtain on what goes on behind the scenes on Government computer networks not connected to the Internet and show some Webs that are not available publicly.

SAIC provides innovative technical solutions to information challenges within Government. During the last 10 months, SAIC's Information Discovery and Distillation Systems Division has been tasked by various governmental agencies to:

This report describes how we used the web to leverage our small team to achieve significant productivity enhancement during development and how we delivered information solutions to the Government that surpassed their expectations. Also illustrated are some unique solutions that were incorporated into WWW technologies along the way.

1. Introduction

Science Applications International Corporation (SAIC) has, over the course of the last 10 months, been asked by several governmental agencies to assist them in setting up various information discovery and distillation systems. In each case, the customer required some means to distribute information among employees using their existing computer networks. As you will see, that is about where the similarity ended. Yet, in all cases the World Wide Web provided a major part of the solution.

2. Message Handling and Retrieval System

SAIC is under contract with a defense organization that consists of over 500 personnel working in a closed, heterogeneous workstation environment (PC, Macintosh, and Sun) which has physical electronic connectivity but no appropriate means to receive, find, and share information electronically.

To provide for the efficient flow of information in this environment, we integrated numerous public domain and COTS software products so that users can have information distilled based upon their needs and can also have a means of discovering information which can supplement their current sources.

Figure 1: Message Handling Home Page

WWW serves as the conduit for integrating Topic, News, and a Web server for information distillation and WAIS, Sybase, and the same Web server for information discovery. In the distributed, client/server environment which has been created for this customer, these technologies eliminate the costly loss of information and slow information flow within this area.

In order to integrate the aforementioned technologies in an environment with homogeneous servers but heterogeneous clients, custom gateways and flow-control scripts were created to allow for the management of real-time data flow and the presentation of information in a fashion which was consistent across all clients. Flow-control scripts have been written in C and UNIX shell and gateways have been created using Perl and HTML Forms (with associated C routines to manage form inputs). Through this integration, multiple technologies can be applied via WWW/Mosaic to service the broad needs in the Department of Defense.

Figure 2: Message Handling and Retrieval System Architecture

How the MHRS Works

The organization's messages are received electronically in a central location. Each message must be routed to the appropriate office based upon matching a keyword profile previously created using Verity's Topic software. Each message may match the profile of zero, one, or more organizations. Messages can have various priorities. High priority messages are immediately sent to the user through electronic mail. Routine and normal messages are categorized by office and are sent to corresponding newsgroups using the Internet News Daemon. Messages are then available immediately for browsing and are also stored into an archive for later retrieval.

The first time a user starts up Mosaic, they are presented with a form that asks them what organization they wish to be their default. They simply select one of the organizations and press the submit button. This immediately takes them to the home page, which is dynamically generated to have hypertext links to the proper newsgroups and archives for their organization. From then on when the user starts up Mosaic, the home page appears with their default choice based upon their IP address. A hypertext link is provided which allows the user to change their default organization if they should move to another computer.

The user is given the ability to browse current messages by using the Mosaic software in conjunction with the NCSA httpd 1.3 web server and a custom news gateway that we have developed. In the beginning, we used the inherent news reading capabilities built into Mosaic. However, the users complained about having to step back and forth between the subject headers and the text in order to read the messages. What they wanted was the ability to have "Next" and "Previous" buttons embedded within each message. We felt it was very important to use the same interface for all message reading activity, so we rejected the idea of installing a dedicated news reader such as "xvnews" or "xrn." We wanted Mosaic to be the sole interface for all message browsing.

Our custom news gateway reads the .overview file for the particular newsgroup to create a dynamic set of HTML files that include "Next," "Previous," and "Backö buttons to allow more rapid browsing through all messages. The user is first presented with a subject list of messages that were posted since the previous evening. When the user selects the first message, a gateway program is used to generate the HTML that includes the next and previous links and a link back to the subject list. We have not developed a threaded news reading capability since our users do not get threaded messages.

Once messages are processed by the news daemon, they are sent to a custom archive program we call "dispatcher" that performs three tasks. First, it creates the archival symbolic links from the message file located in the office directory to an archive directory. These symbolic links allow us to store the message only once, but for many purposes. Then the dispatcher calls "sybase_send," which parses the fields from the message and inserts the message header fields into a Sybase database with one field being the path and filename of the archived message. Finally, dispatcher calls "append2day" which calls waisindex to add the message to the daily index so that current messages can be found by keyword.



Figure 3: Message Handling Query Form

The WAIS keyword indexes are built for each organization and are kept for 60 days. One challenge that had to be overcome was that the customer needed messages to be added to the index in real-time. Performance limitations and locked access did not allow us to simply add the new files to the 60 day archive. Instead, we created a separate, smaller index that just holds today's incoming messages. This smaller index is deleted nightly and the 60 day archive is rebuilt after removing the files older than 60 days. At that point a new daily index is created to hold the incoming messages.

To provide the users the ability to find specific messages, either current or archived, we created a custom gateway which allows fielded and free-text searching combined. Our gateway hides from the user whether the query is being posted to Sybase or WAIS. The result is that the user can enter a fielded query, say a date range, at the same time entering a free text query, such as some key words, and the results from both Sybase and WAIS are correlated and presented to the user in a unified result.

Considering the customer's previous message handling system, paper, the MHRS has been a resounding success. Not only can users get current data faster, but they now have a way to search vast stores of archived messages, enabling them to prepare reports with more depth. And it doesn't matter what kind of computer they use.

3. Personal Data Server

Government analysts need a database management system that they can configure and use without requiring system administration support. A recurring comment is "I can do more on my home PC than I can on these SPARCstations." Now that the analysts have ample processing capacity, they need to have the productivity tools that make their jobs easier. Perhaps at home they use a program like Claris's FileMaker Pro to keep address lists or recipes. Why can't they have an easy-to-use database like this at work to get things done? Perhaps they want to keep a record of the activities of certain entities. Or maybe they just want a record of people in their organization and what skills each of them have. These types of databases could all be set up now by using an RDBMS with the help of the database and system administrators. But for these simple tasks, couldn't the users set up and administer their own databases like they do at home? We think so. That is why we would like to describe our concept of a "Personal Data Server," which will be to databases what PCs are to mainframes.



Figure 4: Personal Data Server Query and Result

Just like the commercial products available for the PC, the Personal Data Server (PDS) must be extremely simple to set up and administer, with a graphical user interface that makes data entry and data query as easy as possible. It should be nowhere near as complex as Sybase or Oracle to set up. There must be several interfaces to the database, ranging from simple commands typed in a UNIX shell to a custom X- Windows interface which displays tabular data to the use of existing tools which query the database and display maps, timelines, network diagrams, tables, graphs, and a whole host of other possibilities. If the PDS were an engine, it would be a lawnmower engine that could be used for a go-cart or a power generator or many other small tasks, but certainly not a locomotive engine or jet engine.

The PDS should also allow the data to be viewed by selected others on the network, allowing the user to share the information they have accumulated. Living up to the "S" in its name, the PDS will serve data not only to the user that maintains it but also to other designated members of the team or even the entire organization, whatever the user wants to do. Eventually, a network of Personal Data Servers could be set up so that the results of a whole team of analysts can be automatically merged together to create a daily "analysis newspaper" custom designed by each individual, complete with graphics, images, and text summaries relevant to the team's analytic area.

SAIC has spent several years developing a database browser tool, called Screenwork, that is simple to use and provides a great deal of functionality without requiring Sybase or other software licenses. Through our dealings with analysts we have continually been asked if there is a way that they can save the information they are browsing for later review or accumulate data over a long term for analysis or simply keep a list of relationships between entities of their choosing. Until now, there has not been.

The Personal Data Server had some guiding principles for its development. The PDS must:

We envision the PDS being used for a variety of things from address lists to zonal activity databases. An analyst might keep a personal record of the activity of an entity and at the end of each month query their PDS to show the trends in activity of the entity they are monitoring. In short, our vision is that analysts should spend time doing analysis, not manipulating data.

Work on the preliminary version of the PDS is complete. Currently, only UNIX platforms are supported directly. The undercarriage of the PDS utilizes a set of Perl scripts called RDB [1]. These scripts provide filters, like "row" and "column," which serve to extract data from tab-separated database files. Our interfaces to these scripts provide an easy way for the users to manipulate their databases without having to learn the syntax of these scripts.

How the PDS Works

The PDS is packaged as a tar file which can be easily installed in each user's home directory. The package includes the NCSA httpd_1.3 web server, the RDB utilities, and our custom cgi-bin scripts and HTML files. The users run an installation script that sets up the configurations. We assume that the users already have Mosaic installed. Next, they start their own httpd server (operating on an unprivileged port) by typing "StartPDS." This simply starts httpd and points to the configuration directory. Next, they are instructed to start Mosaic by typing Mosaic -home http://hostname/. Their default home page gives them access to the PDS.

The PDS home page provides the user with the options of creating a new database, querying a database, modifying an existing database, and deleting a database. When creating a new database, the user can enter the name of the database, the number of fields, and the width of each field. The PDS then creates a form which allows them to enter information into the database based upon their design. Once information has been entered, the query form gives the user a list of all fields available in the database and the option of deselecting certain ones. The form also allows the user to specify the sort order of the result and the search match criteria for the data. Once the user presses the submit button, the database query is issued. Actually, the query consists of a number of RDB scripts strung together properly. For example, the "row" command which specifies the search matches is piped to the "column" command which cuts out the user's selected columns. This is in turn piped to the "sorttbl" command and then the "ptbl" command which formats the result nicely with headers and tab separations adjusted for the field widths.

The PDS is opening up new capabilities for the users. It promises to provide a simple way of setting up databases customized by the individual. By incorporating the PDS into the Web, users have a simple means of sharing their data with others in their team. Using a Web browser, the PDS will require very little training of new software.

4. Context-Sensitive, Hypertext Help

As mentioned previously, SAIC has developed a database retrieval browser called Screenwork. While the tool was designed to be extremely simple to use with an OSF/Motif graphical user interface, the software provides the user with a number of sophisticated options which should be illustrated carefully to them.

In January 1994 we embarked on an effort to convert the entire user documentation and on-line help into HTML format suitable for use in Mosaic. The previous on-line help system was limited to hypertext only, no graphics or fonts such as can be found in HTML.

Figure 5: Context-Sensitive, Hypertext Help

How Context-Sensitive Help Works

When the user needs help at a particular point in using our software, they position the cursor over the area of interest and press the "Help" key. The software checks the user's environment to determine if "Mosaic" is in their path. If not, they get the old hypertext help. If so, the software starts Mosaic pointed to the correct page for help. In our case, each widget corresponds to a filename which is symbolicly linked to the HTML file for the entire window.

The software already had a hypertext help system using a widget that provides the hypertext links. Every widget in the application had a help callback assigned to it which pointed to a page of hypertext. When we decided to move to using HTML and Mosaic for the hypertext help, we used the exact same callbacks, but instead of every widget calling a different page, we lumped most widgets together into a page for the entire window in the application. Thus, instead of getting help on the "Add" button, the user now gets help on the whole window, of which the "Add" button is a component. Also, if the user keeps Mosaic open and presses the help key again, the software signals Mosaic to point to the new page, making use of Mosaic's remote control feature.

Our users found that the new HTML-based hypertext help system provided them with a much better description of how to use the software. By using Mosaic, we made use of an interface that they were already accustomed to, eliminating the need for training.

5. Cross-Platform Image Retrieval System

Before the WWW technologies were mature, SAIC developed a client/server imagery database with a custom OSF/Motif front-end that allows the user to perform fielded queries to retrieve a set of matching images. This interface took several months to develop with several developers working on select pieces of the code. And it only ran on UNIX workstations supporting the X-Windows protocol.

Now that the WWW technologies have matured, the imagery server has been redesigned to allow HTML- capable browser to query the same databases, browse the same matching images, and select the specific images of interest for further examination. This allows any user, running on PC/MS-Windows or Macintosh platforms in addition to UNIX workstations, to perform the same analytic tasks that previously required custom-built software. This saved the Government thousands of dollars in costs associated with porting the imagery client software to the other platforms and maintaining the three separate versions of software.

6. Technical Reports

On another contract, we were asked to perform technical research. Normally the results are provided in a hardcopy report at the end of the contract. This time, however, we delivered our reports to the Government in HTML format, allowing them to immediately post the information on their web server and share the results with their entire organization. Our report included an evaluation of several software products. Rather than simply describing each product, we were able to include screenshots of the products that enabled the customers to envision a demonstration of each product with their WWW browser software. Here, by creating our content in WWW format, we were able to provide a superior product and enable greater distribution.

7. Conclusion

The web provides more than a way to browse the Internet itself. For both government and commercial enterprises wishing to establish internal information discovery and distillation systems, the technologies available because of the World Wide Web provide a means to rapid solutions for many problems. For customers that require heterogeneous platform solutions, the web provides a solution that eliminates the burden of having to develop software for each platform. Even some of the most routine technical reports can be transformed into something exciting to the customer by putting them on the web.

8. Acknowledgments

The author wishes to thank Jeffrey Scott for the initial thrust in writing this paper and to each of the members of the SAIC Information Discovery and Distillation Systems Division for believing in the Web.

9. References

[1] RDB written by Walter V. Hobbs and available at ftp://rand.org/pub/RDB-hobbs/RDB-2.5k.tar.Z




10. Author Biography

Paul D. Boyer
Enterprise Software Consultant
Science Applications International Corporation
Mr. Boyer has over 10 years of experience in developing information systems development for the national defense community. He has acted in systems engineering, software design, software development, program management, anc consulting roles in building systems which distill information from vast stores, and sources, of data.

Mr. Boyer is expert in the development of analytic software tools which assist users in the summarization, recognition, extraction, and exploitation of data. His responsibility is to help the national defense community in managing the discovery and distillation of information in electronic environments. He is expert in the use of the World Wide Web, WAIS, News, Tcl/Tk, and Perl and provides these solutions to his clients. He holds a BSEE from West Virginia Tech and an MSEE, specializing in Software Engineering, from The Johns Hopkins University.

pdboyer@c3i.saic.com