How to Present Lots of Volatile Information on the World Wide Web

Donald Jennings, Peter Damon, Maia Good, Ryszard Pisarski

Hughes STX Corporation and NASA Goddard Space Flight Center


Abstract

When most people think of the information available via Mosaic on the World Wide Web they envision static document pages containing a few pretty images and hyperlinks to other static document pages. Occasionally these pages link to more exotic objects such as gopher archives and databases with WAIS server front ends, but for the most part hypertext documents consist of information that requires manual updating on a weekly, monthly or even yearly basis.

Creating and managing static Mosaic hypertext documents is a task well within the capabilities of most Web-literate people. What do you do, however, when given the task to provide your community with thousands of hypertext documents whose content changes on a daily basis? Although it might seem a difficult task, it is actually easy if you make use of a few simple tools offered by HTTP and Unix.

This paper focuses on the problem of serving large sets of volatile information on the World Wide Web. Emphasis is placed on our organization's experience building and maintaining a high volume, constantly changing hypertext document set that allows users to access processing status information for the ASCA x-ray satellite. We also discuss the Unix and HTTP software tools needed to build volatile information Web servers and provide some specific examples of their use.


I. Introduction

The World Wide Web (Berners-Lee, 1994a) has become an extremely popular and powerful mechanism for serving data on the Internet. Although its original intent was for dissemination of information within the High Energy Physics community (Berners-Lee, 1994b), its use has spread far beyond the scientific arena. Today the Web is being used by commercial ventures to advertise their products and services, by government agencies to promote policies, by scientific centers to aid in research and by academic institutions to educate and provide public outreach.

Although the types of information on the Web are wide and varied, most organizations use one of two basic methods to actually make their data Web accessible. The first method, static web pages, is by far the most common. These documents are written or translated into HyperText Markup Language (HTML) (Andreessen, 1993) and linked into other static document sets. This method is a viable, time effective means of Web information generation provided the number of documents in each set remains small (less than 100) and the document content remains relatively constant over time. Many organizations now use static Web documents to provide information about their charter, services, and capabilities to the outside world. Examples of such usage are plentiful.

The second major method of Web data access, especially useful when information resides in structured databases or in file archives, is to build a front-end query server to the separate information repository. The front-end server talks to the repository and returns the results of queries to the Web client. Although the Wide Area Information Server, or WAIS (Kale et al, 1992) search engines are a currently popular form of front-end server, other query servers such as the NaviSoft natural language server (McKee, 1994) and NEXOR's ALIWEB (Koster, 1994) are also available. Some good examples of Web servers that make use of front-end query engines are the Cardif UK movie database (Cardif, 1994), the NASA/HEASARC StarTrax interface (StarTrax, 1994), and the Human Genome Database server (GDB, 1994).

While these two methods conform to the types of information that many "on-Web" organizations wish to present to the world, there are many other instances where static pages and front-end engines to not work well. Consider those cases where one or more of following is true: (1) the HTML document set contains hundreds or thousands of pages, (2) the document content changes on hourly time scales and (3) the data does not reside within structured databases or data archives. As we will demonstrate, the Web is still an excellent mechanism for providing information to the outside world even when one or more of the above criteria apply. A data set need not be simple (static pages) or highly organized (front-end query engines) to be effectively served on the World Wide Web.

In this paper we present one solution to the problem of making lots of rapidly changing information available on the World Wide Web. The system involves building a Unix shell script-based automated system which allows large numbers of HTML documents to be created, maintained, and updated efficiently and accurately. In section II we give a real working example of a Web information system that automatically creates and updates thousands of HTML documents daily. In section III we discuss the Unix and HTTP tools on which the system is based. Section IV summarizes our experience building an automated Web information system and explains why we feel such systems are an important model for future World Wide Web applications. Finally, in section V specific examples of programs that use the Unix and HTTP tools are provided.


II. ASCA Data Center WWW server

The ASCA Data Center , part of the Astrophysics Data Facility at the NASA Goddard Space Flight Center , processes ASCA x-ray satellite observatory (Tanaka et al, 1994) raw telemetry data into usable scientific products, archives the data products and distributes them to ASCA guest observers. The guest observers, instrument teams and operation teams associated with ASCA often want to quickly and accurately know the status of observations being processed at the Data Center. Since thousands of observations and hundreds of gigabytes of data flow through the ASCA processing pipeline each year, the job of reporting on the exact status of each observation within the system for each individual query could be a very resource consuming chore.

To provide our constituency with continuous ASCA processing status information we chose to build a World Wide Web server that creates "snapshots" of the processing pipeline system. The snapshots are generated by gathering status information from all the various subsystems of the processing pipeline on a daily basis, distilling it into an information kernel and converting the kernel into an HTML document set.

[Architecture]

Figure 1. ASCA Data Center WWW Server Architecture.

Since the information that describes the state of the ASCA processing pipeline resides in several different forms across the system, a set of information gatherer routines are employed. These routines go out and fetch system status information from its various locations and condense it into a small set of distilled data files known as the status kernel .

Once the status kernel has been created another set of routines, known as conversion routines, takes over. The conversion routines read the kernel and pre-build most of the HTML document set. Other dynamic conversion scripts handle the creation of the remaining document set members as they are requested. In theory the entire HTML document set can be created dynamically as Web clients request them. However, we find that for most of our HTML documents pre-building is more efficient. Creating all of the requested documents on-the-fly would cause significant wait-times for the Web clients, and the rather convoluted process of converting the status kernel into organized HTML documents need only be done once after each execution of the gathering routines. In certain cases, such as when an entire log file is requested, Dynamic document creation is still the best method of serving the Web client's requests.

One of the key features of this Web server is automation. Without automation it would not be possible to build, organize and update the 1100 pre-built and 800 dynamically built HTML documents that currently comprise our growing document set. Another reason that makes automation important is that the information we serve on the Web must be updated at least once every 24 hours to accurately reflect the state of the ASCA processing pipeline. Thus, the actions of gathering, distilling and transforming the processing status information into HTML must also be done daily. Manual updating of the entire Web server document set would be a tedious job consuming many Full Time Equivalents (FTEs) in precious and expensive human resources.

Another key feature of this Web server is its overall design. It provides a template for a new type of Web server class that relies upon neither static HTML pages nor highly organized information accessible by a front-end query engine (see section I). This Web server design handles numerous volatile HTML documents -- a task that would be prohibitively difficult using the static document server model. And, unlike servers based upon front-end query engines, this Web server can gather information from any source, even sources that are not structured databases or indexed document archives. Consequently, this Web server design demonstrates that large sets of HTML documents, comprised of information that is neither easily accessible nor stable, are serviceable on the World Wide Web.

A third, and perhaps most important, key feature of this Web server is that it was constructed from scratch, with relatively little effort, using utilities bundled into Unix and the HyperText Transfer Protocol (HTTP) daemon (McCool, 1994a). This fact provides evidence that complex Web-based information systems are within the capabilities of most organizations to build (ie., this stuff is easy). The following section contains details on how to build such a server using Unix and HTTP tools.


III. The Hammer and the Butter Knife: A Basic Web Server Toolkit

Building the kind of Web server described above requires the execution of only three actions: If the server is to reside on a Unix platform that has HTTPD installed then all the software tools necessary to perform these actions are present. The tools in the toolkit include: sed, awk, Korn shell, Bourne shell, and the Common Gateway Interface.

Sed and awk are very powerful standard Unix utilities that manipulate ASCII files. Both tools read in data files, change the input data based upon a set of pattern matching instructions and write the result to output files. The sed tool is primarily used for string substitution, whereas awk's major strength lies in its data extraction and formatting capabilities. When used to their fullest extent these two utilities can form the basis of a structured database system (Comer, 1982), however, simple applications of both tools are all that one needs to build a Web server such as the one presented here. A good starting reference for programming with sed and awk, in addition to the Unix man pages, is a book appropriately entitled "sed & awk" (Dougherty, 1990).

The Korn and Bourne shells are convenient, albeit inelegant, languages for writing programs that use system utilities such as sed and awk. What makes them convenient is that sets of Unix commands may be combined into programs, known as scripts, and executed many times over. The shells also provide built in and user definable variables, simple programming structures and several file-oriented operations. The Korn shell, being a superset of the older Bourne Shell, actually provides a superior programming environment; however, not all versions of Unix come with the Korn shell. If portability of the Web server is important then use Bourne shell, if portability does not matter then use Korn shell. "The KornShell Command and Programming Language" (Bolsky and Korn, 1989) and "The Unix Programming Environment" (Kernighan and Pike, 1984) are the definitive references for the Korn and Bourne shell languages, respectively.

The Common Gateway Interface, or CGI, (McCool, 1994b) allows HTTP-class Web servers to execute external programs, referred to as gateways, and return the results to the requesting client. Thus, hyperlinks in HTML documents may point to CGI compliant programs on the server as easily as they point to other HTML documents; the only significant difference being that the server returns the results of the CGI compliant program to the client in the first case, and a pre-existing hypertext document to the client in the second. CGI programs are commonly used in conjunction with HTML Forms (Mosaic, 1994) to process the information entered into the form and return some result. They also allow servers to dynamically create HTML documents.

The key to using CGI is understanding how information gets passed to CGI compliant programs and how the programs must encapsulate information they return to the client. When a server receives a request to execute a CGI compliant program it first sets a number of environment variables whose values contain information pertaining to the request. The program may then extract the information it needs by querying the environment variables. CGI programs always return information on stdout. This information may take the form of a server directive or direct client output. The two most used server directives are the Location directive and the Content-type directive. The location directive feeds the server the URL of another HTML document that it should return to the client, and the content-type directive informs the server that output following the directive should be appropriately interpreted (as determined by the directive's value) and sent back to the client. Information that bypasses the server and goes to the client directly must be correctly formatted by the CGI program in order for the client to understand it; this makes server directives the easer of the two methods to use.

Appendix A of this paper contains three examples of shell scripts that demonstrate how to make use of the above mentioned tools. Please take note that these scripts are not fancy do-all programs, and are meant only to help the novice Web programmer get started.


IV Summary

Many servers connected into the World Wide Web provide information that exists in one of two basic forms: (1) static HTML documents sets and (2) structured data repositories accessed by some type of front end query engine. While these two methods work well for much of the information organizations wish to make available on the Web, there are other classes of information for which neither method applies. It is still possible, however, to make such information Web-accessible by devising new server strategies to meet the specific needs of the system.

One example of a Web server that employes simple techniques to handle a new class of information is the ASCA Data Center's WWW server. This server provides the ASCA community with status information on a complex data processing system by taking daily "snapshots" of the state of the system. The data which makes up the snapshot must be gathered from many different places and assembled into a small set of distilled files known as the status kernel. The kernel's contents are then converted into HTML documents in a process known as pre-building. Since pre-building the entire document set is not efficient, some documents are created on-the-fly as Web clients request them. Because the HTML document set contains large numbers of files whose content changes daily automation plays a key roll in its creation and maintenance.

Using the World Wide Web to provide processing status information to the ASCA community has turned out to be a very good move. Our organization likes the the Web server because it frees us from answering frequent questions from our constituency and requires negligible human intervention to maintain. Even members of our group use the Web server to monitor the overall progress of the data processing pipeline, since it provides a complete and easily accessible picture of the system. The ASCA community likes using the Web server because it supplies information 24 hours a day, seven days a week and they always know where to find the information they desire.

Although we are pleased with the design and performance of our Web server, advertising its existence and discussing its features is not the main focus of this paper. Based upon our experience, we wish to convey the idea that devising and building new types of Web information servers is not only possible -- its easy.


V. Appendix A : Example Programs

The first example program comes from the ASCA Data Center server's gather routines. Written in Korn shell, it makes use of sed and awk and grep (not discussed above) to read the contents of a certain type of processing log file, extract the desired information and write it to one of the status kernel's data files.

Example 1. Gather Routine.

This next example is of a conversion routine that reads from one of the status kernel's data files and translates the content it into HTML output.

Example 2. Conversion Routine.

This last example shows a CGI compliant gateway program. It reads the value of the QUERY_STRING environment variable set by the host's HTTP server, derives the name of a processing pipeline log file based upon the value of QUERY_STRING and sends the entire log file back to the Web client in HTML form.

Example 3. Dynamic HTML Document Creation Routine.


IV. Bibliography

Berners-Lee, Tim, (1994a). "World Wide Web Initiative", CERN - European Particle Physics Laboratory. http://info.cern.ch/hypertext/WWW/TheProject.html

Berners-Lee, Tim, (1994b). "WWW Policy", CERN - European Particle Physics Laboratory. http://info.cern.ch/hypertext/WWW/Policy.html

Andreessen, Mark, (1993). "A Beginner's Guide to HTML", National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign. http://www.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer.html

Kahle, B., Morris, H., Davis, F., Tiene, K. (1992). " Wide Area Information Servers: An Executive Information System for Unstructured Files". Electronic Networking: Research, Applications and Policy, 2(1), 59-68

McKee, Douglas, (1994). "Towards Better Integration of Dynamic Search Technology and the World Wide Web", First International Conference on the World Wide Web, May 25-26-27 1994, CERN, Geneva Switzerland. http://www1.cern.ch/PapersWWW94/doug.ps

Koster, Martijn, (1994). "Aliweb -- Archie-Like Indexing in the Web", First International Conference on the World Wide Web, May 25-26-27 1994, CERN, Geneva Switzerland. http://www1.cern.ch/PapersWWW94/aliweb.ps

Cardif, (1994). http://www.cm.cf.ac.uk/Movies/moviequery.html

StarTrax, (1994). http://heasarc.gsfc.nasa.gov/StarTrax.html

DGB, (1994). http://gdbwww.gdb.org/gdbhome.html

Y. Tanaka, H. Inoue and S.S. Holt, (1994). "The X-ray Satellite ASCA", PASJ, 46, L37

McCool, Robert, (1994a). "NCSA httpd Overview", National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign. http://hoohoo.ncsa.uiuc.edu/docs/Overview.html

Comer, Doug, (1982). "The flat file system FFG: ad database system consisting of primitives". Software -- Practice and Experience, November 1982.

Dougherty, Dale, (1990). "sed & awk". Published by O'Reilly and Associates, Sebastopol, CA.

Bolsky, Morris and Korn, David, (1989). "The KornShell Command and Programming Language". Published by Prentice-Hall, NJ.

Kerighan, Brian and Pike, Rob, (1984). "The Unix Programming Environment". Published by Prentice-Hall, NJ.

McCool, Robert, (1994b). "Introduction to the Common Gateway Interface", National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign. http://hoohoo.ncsa.uiuc.edu/cgi/intro.html

Mosaic, (1994). "Mosaic for X version 2.0 Fill-Out Form Support", National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign. http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/overview.html


Biographic Summaries

Donald Jennings is currently a Senior Scientific Programmer/Analyst assigned to the Astrophysics Data Facility (ADF) at NASA Goddard Space Flight Center. He received his Master of Science degree in Physics from Iowa State University (1991), and his Bachelors degrees of Science in Computer Science (1989) and Physics (1988) from the University of Missouri. His masters Thesis is in gamma ray astronomy and his current interests include: research into extra-galactic gamma ray emission, scientific data format utilization, and developing tools to present and retrieve information on the World Wide Web. Mr. Jennings is employed by Hughes STX Corporation. Email address: jennings@tcumsh.gsfc.nasa.gov

Peter Damon is Section Manager for the ROSAT, ASCA and XTE groups in the Astrophysics Data Facility at the NASA Goddard Space Flight Center. He received his Bachelors degrees of Science in Microbiology (1981) and Computer Science (1990) from the University of Maryland. Mr. Damon is employed by Hughes STX Corporation.

Maia Good is currently a Senior Systems Programmer at the Astrophysics Data Facility at the NASA Goddard Space Flight Center. Her software engineering experience is in the areas of graphical user interfaces,database management systems, telecommunications networks and software configuration management. Her education includes a Master of Science in Educational Technology from Lehigh University (1982) and a Bachelor's degree in Social Sciences from the Federal University of Rio de Janeiro in Brazil (1964). Ms. Good is employed by Hughes STX Corporation.

Ryszard Pisarski is the U.S. ROSAT Science Data Center Project Manager. He is also involved in the ASCA and XTE x-ray satellite missions. His research interests are in x-ray observations of supernova remnants. He received his PhD in Physics from Columbia University in 1984.