Providing Data on the Web: From Examples to Programs

Carlos A. Varela, Caroline C. Hayes
Department of Computer Science
University of Illinois at Urbana-Champaign
cvarela@cs.uiuc.edu, hayes@cs.uiuc.edu

URL: http://fiaker.ncsa.uiuc.edu:8080/WWW94-2/paper.html

Abstract

The World-Wide Web provides access to a global information universe using available technology [Berners-Lee et al. 1992]. In order to fully realize the benefits of this information system, we are developing a system, Zelig, to provide on-the-fly access to databases and dynamic information through effective user interfaces [Varela and Hayes 1994].

In this paper, we have extended Zelig to generate code for performing conversions from fixed data formats into hypertext. Consequently, information providers only need to give examples of their current database reports and the desired hypertext to be generated for those particular examples. Zelig produces the program to extract relevant data from the reports and the schemata to drive the hypertext generation process. We include as an example, an interface to ph/qi, the CCSO nameserver software providing data for academic institutions around the world.

1. Introduction

The World-Wide Web offers easy access to a universe of information by providing links to documents stored on a world wide network of machines in a very simple and understandable fashion. Much of its success is due to the simplicity with which it allows users to provide, use and refer to information distributed geographically around the globe. Another important feature is its compatibility with other existing protocols, such as gopher, ftp, netnews and telnet. Furthermore, it provides users with the ability to browse multimedia documents independently of the computer hardware being used.

The World-Wide Web is based on the HyperText Transfer Protocol, HTTP, and the HyperText Markup Language, HTML. HTTP is a generic object-oriented stateless protocol to transmit information between servers and clients [Berners-Lee 1992]. HTML is a simple, yet powerful platform-independent document language [Berners-Lee and Connolly 1993].

When the documents to be published are dynamic, like those resulting from queries to databases, the hypertext needs to be generated. For this purpose, there are scripts, which are programs that perform conversions from different data formats into HTML on-the-fly. Even tough for fixed data formats these scripts may be simple, providers need them to be able to publish their data on the Web. Furthermore, even basic changes to the data formats or the generated HTML, imply changes to these scripts.

To overcome these problems, Zelig generates scripts that base their HTML generation on schemata [Varela 1994]. In this research we extended Zelig, to additionally produce code for performing conversions from fixed data formats into HTML. There are two main stages in this conversion process: extracting database, record and field information from your traditional database report; and instantiating that categorized information along with the query information into a particular schema.

Using Zelig to provide access to dynamic information in a fixed format, providers only need to give examples of their current text reports and the desired hypertext to be generated for those particular examples. Our system, Zelig, produces the program to extract relevant data from the reports and the schemata to drive the hypertext generation process. Thus, it becomes easier to provide effective user interfaces to dynamic information in the World-Wide Web.

In the next section, we elaborate more on the server-client model used by the World-Wide Web, and the functionality of scripts. In section 3, we highlight the problems faced by providing WWW access to dynamic data. In section 4, we explain the architecture of Zelig, our system that performs schema-based HTML generation. In section 5, we demonstrate the ideas presented with a gateway to the CCSO ph/qi nameserver databases. Finally, in section 6, we give some conclusions and results.

2. Background

2.1. The World-Wide Web: A Server-Client Model

The World-Wide Web consists of a network of computers which can act in two roles: as servers, providing information; or as clients, requesting for information.


Fig 2.1. Server-Client Architecture [Berners-Lee and Cailliau 1992].

This communication is performed under the stateless HTTP protocol. In a stateless protocol, connections are created, processed and closed without keeping state information. The server actions depend on predefined methods such as GET, POST, PUT, and DELETE.

The resulting information can be served in different format types and it is the client's responsibility to present it in a consistent and clear manner. The most common format is HTML, which contains information, and its logical structure; but leaves out those details particular to specific browser implementations.

It is important to note that a server can provide static documents to the clients, but it could also provide transparent access to databases or other information sources. In other words, the clients can also request for specific queries that should be processed by the server. Scripts or gateways take care of this processing. These are programs that communicate with the WWW server software under a predefined interface. The most common currently used interface is the NCSA's Common Gateway Interface, CGI [McCool 1993].

2.2. Scripts: WWW Gateways to Databases

Scripts are CGI compliant programs that act as clients to the applications owning the data and produce the corresponding hypertext for the requested information. They communicate with the WWW servers through an interface, in this case CGI, which establishes how to pass the information from the WWW client to the script and from the script back to the WWW server and subsequently to the WWW client.


Fig 2.1. Purpose of a script and a WWW server [Berners-Lee and Cailliau 1992].

These scripts are written in any programming language (like C, C++ or PERL) and their main functions together with the WWW server are:

3. Providing Dynamic Data on the Web

After the short introduction in the previous section to the mechanisms under the Web, let's see why we want to automate the script creation process:

4. Zelig: From Examples to Programs

Figure 4.1 shows a general framework for Zelig. Scripts generate HTML reports based on instantiating schemata to the query info and the categorized database output. The schemata can be taken from a library, or generated from HTML report examples. The query info is created by the HTML Query Form, which is provided by the application designer. This information is given to the traditional database manager system which returns a report in a fixed format. This report is parsed and relevant information is extracted and categorized. The resulting HTML Report can contain links to more information on particular records, or even additional HTML Query Forms for more database processing.


Fig 4.1. A General Framework for Zelig.

In section 4.1, we see how the HTML instantiation process takes place. In section 4.2, we see how to extract Query Info from the HTML Query Form, how to extract Categorized Output from the traditional DB Report, and how to abstract a schema from user-given HTML Examples.

4.1. Schema-Based HTML Generation

In traditional HTML generation, user interfaces are created by scripts directly. This implies that changes to interfaces have to be performed at the level of the source code of the script. We present a methodology based on schemata, to allow designers to debug and maintain the user interfaces without directly changing the scripts.

In this section, we will explain the information that is provided by the schemata to the scripts for document generation. Then, we will give a description of ZHTML, the language to write these schemata, which is an enhancement to HTML incorporating directives for database interface generation.

4.1.1. Instantiating schemata

The scripts base their hypertext generation not only on the current parsed database query results, but also on existing ZHTML schemata. We can further categorize this information as:

4.1.2. ZHTML Language Description

A ZHTML schema is an HTML document which has been annotated with comments, which are used as directives for the script. These comments are parsed and executed by the script, and the resulting text is placed instead of the original comment. This is performed at run-time, using the current database query results. Future work includes writing script code generators departing from these schemata.

The ZHTML comments are similar to HTML constructs. They are generally of the form:

There are several Zelig directives with different functionality, including: print a variable value, run an external function printing its output, conditionally include the ZHTML body, traverse all the current database records invoking Zelig recursively on the ZHTML body and traverse all the fields in a specific object.

Following are the main constructs of this Document Type, even though a formal Document Type Definition (like the DTD shown for HTML in [Berners-Lee & Connolly 1993]) is still in progress:

4.2. Automating the Database Report Extraction

4.2.1. Database Output Categorization

We'll concentrate on database manager systems that produce reports with a fixed format. These reports usually contain tabular information, where application-level data is in the beginning and end of the produced report. For example, the directory being listed, or the university being accessed for phone information. In the middle, we often find repetitive information in a structured fixed way. It's repetitive because there is one entry for each record matching the original query. These entries are usually separated by a record separator, which allows us to differentiate among records. Finally, we also have a field separator. which allows us to divide record information into yet more specific detail.

In a file listing example, the first line has application-level information, the total space occupied by the directory. Then, we see records (files) that in turn can be divided into fields (name, size, owner, date...) What we do, is to guess where these separators lie and confirm them with the user, prompting her for any unknown information. Then, we proceed to generate the data structure, necessary to instantiate the ZHTML file once new queries get requested.

4.2.2. Generating the Query Info from the HTML Form

HTML forms contain a name and a value pair. For example, a form may contain three variables: directory, mask, and sort-by which have default values and get instantiated to the user-given values when the form is submitted.

Note that the query info described above, can contain information that will not be processed by the DBMS, but instead it is functionality provided additionally by Zelig, such as sorting by a specific field.

Here in this subsection, we work on generating the database query from the form variable bindings and the given query example.

In our examples, we mainly have to create:

4.2.3. Generating a Schema from a Given Hypertext Example

Once we know for a particular example, how we want our hypertext to look like; i.e. we have HTML files for specific queries; we can abstract those ZHTML schemata to be instantiated to other queries as well.

We do this by querying the user when we aren't sure if the information parsed is relevant (needs to be categorized to subsequently be used by the schema instantiation algorithm) or it is just a separator.

In the following section, we show an example illustrating a schema, and a couple of its possible instantiations depending on the database query.

5. A Running Example: The CCSO Phone Nameserver Database

The CCSO nameserver software provides a server-client model for accessing phone directory information from academic institutions [Dorner 1992]. The figures in this section have been created browsing HTML files in NCSA Mosaic for X [NCSA 1993].

The following is an HTML Query Form to access those databases:

The following link contains a schema for instantiating the categorized database information, once a query has been made. We will show two different instantiations depending on two different user queries for this same schema:


6. Conclusions

The success of a distributed information system lies heavily on the simplicity for generating, providing, using and referring to information. The World-Wide Web is composed by excellent protocols, tools and languages to perform these actions for static information. We have designed an extension to this technology to easily provide access to dynamic information, such as the result of queries to existing databases.

The functionality for our system, Zelig, was described in this paper. Its main improvements over previous technology include:

We provided a hyperlinked example giving WWW access to the CCSO ph/qi nameserver software. This gateway running at NCSA, as of September 1994, provides phone directory information for about 250 academic institutions around the world, and receives more than a thousand queries per day.

Ultimately, Zelig offers the user an effective way to generate fully customized interfaces to dynamic data, further closing the gap between information generation, provision and use.

Acknowledgements

Thanks to the NCSA Software Development Group for their helpful comments on this paper and their excellent research and working environment. Additional thanks to Professor Dershowitz, for his comments and motivating research [Dershowitz 1983].



References

[Berners-Lee 1992]
T. Berners-Lee. Hypertext Transfer Protocol Requirements. Internet Working Draft. CERN. Work in progress.
http://info.cern.ch/hypertext/WWW/Protocols/HTTP.html

[Berners-Lee et al. 1992]
T. Berners-Lee, R. Cailliau, J. Groff, B. Pollermann. World-Wide Web: The Information Universe. Electronic Networking: Research, Applications and Policy, 2(1), pp. 52-58, Meckler Publications, Westport CT, Spring 1992.
ftp://info.cern.ch/pub/www/doc/ENRAP_9202.ps

[Berners-Lee and Cailliau 1992]
T. Berners-Lee, R. Cailliau. World-Wide Web. Submitted to Computing in High Energy Physics 1992.
ftp://info.cern.ch/pub/www/doc/chep92www.ps

[Berners-Lee and Connolly 1993]
T. Berners-Lee, D. Connolly. Hypertext Markup Language: A Representation of Textual Information and Metainformation for Retrieval and Interchange. Internet Working Draft. CERN, Atrium Technology Inc. Work in progress.
http://info.cern.ch/hypertext/WWW/MarkUp/HTML.html

[Dershowitz 1983]
N. Dershowitz. The Evolution of Programs. Birkhauser, Boston, 1983.

[Dorner 1992]
S. Dorner. The CCSO Nameserver, Server-Client Protocol. Computing and Communications Services Office. University of Illinois at Urbana-Champaign. July 1992.

[McCool 1993]
Rob McCool. National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign. Common Gateway Interface Overview. Work in progress.
http://hoohoo.ncsa.uiuc.edu/cgi/overview.html

[NCSA 1993]
National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign. NCSA Mosaic. A WWW Browser. Work in progress.
http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html

[Varela 1994]
C. Varela. Zelig: Automating Database Provision for the World-Wide Web Ninth International Symposium on Information Systems, Kobe, Japan, Oct 11-13, 1994. Invited talk.
http://fiaker.ncsa.uiuc.edu:8080/IT94.html

[Varela and Hayes 1994]
C. Varela, C. Hayes. Zelig: Schema-Based Generation of Soft WWW Database Applications. First International Conference on the World Wide Web, Geneva, Switzerland, May 25-29, 1994.
http://fiaker.ncsa.uiuc.edu:8080/WWW94.html


Carlos A. Varela (cvarela@cs.uiuc.edu)

Received his B.S. in Computer Science (CS) at the University of Illinois at Urbana-Champaign, where he is currently a M.S./Ph.D. student. His research interests include integrating formal methods of artificial intelligence in software engineering, specially information systems.

Carlos has also been a research assistant at the National Center for Supercomputing Applications (NCSA) since 1991. At NCSA he has worked in different projects including an alpha shapes visualizer (NCSA Walvis), a World-Wide Web browser ( NCSA Mosaic for X/Windows), and a World-Wide Web server (NCSA httpd for Unix).

In the past, Carlos has been a Math and Computer Science teaching assistant for classes up to differential equations and information systems at the University of Los Andes, Bogota, Colombia. He has also been a consultant for Arthur Andersen & Co., and an Artificial Intelligence fellow at the Beckman Institute for the Advancement of Science and Technology.

Caroline C. Hayes (hayes@cs.uiuc.edu)

Received her B.S. in Math, M.S. in Knowledge-Based Systems, and Ph.D. in Robotics, all from Carnegie Mellon University.

Currently she is an assistant professor at the Department of Computer Science and at the Beckman Institute of the University of Illinois at Urbana-Champaign.

Her research interests include artificial intelligence, specially planning, design, abstraction, and knowledge-based systems; as well as computer-aided manufacturing and design. Professor Hayes is particularly interested in tools evaluating, criticizing and optimizing designs in areas from machined parts, intersection design, roofing design and software design.