Techniques for Server-Side Dynamic Document Generation

Thomas Boutell
Gerald Latter
Quest Protein Database Center
Cold Spring Harbor Labs

Abstract

Techniques for dynamic HTML and GIF document generation will be discussed, with particular emphasis on the REF52 2D gel database system and the gd dynamic GIF creation library. Strategies for fast user response without race conditions, such as the caching of documents in a virtual space and aggressive pre-caching of likely requests, will be presented. Problems of scalability to heavy load environments will be examined.

The usefulness of server-side dynamic document generation as a front-end to existing databases will be discussed. As an example, the REF52 2D gel database will be explored from a WWW technology standpoint. The usefulness of both visual and textual means of document navigation will be considered.

Introduction

The World Wide Web (WWW) is a promising mechanism for the mass presentation of data to a wide audience [1]. Formerly, it was necessary to develop software specifically for particular platforms in order to deliver most forms of information, but the World Wide Web provides a means of communication with any user with a browser; and in practical terms such browsers are available on all major platforms.

On the other hand, WWW presents its own difficulties. Since the protocol is stateless, users do not "log into" and "log off" the IP (information provider)'s system, which complicates security and makes it difficult to create a seamlessly integrated presentation in which the user can make fine adjustments to a database query or other activity without the need for repetition of effort.

The limitation, then, becomes not how many platforms the prospective IP can support directly, but how many applications can be successfully delivered given the real and apparent limitations of the WWW.

The REF52 Database

The REF52 database (accessible at the URL http://siva.cshl.org:8002/cgi-bin/REF52/optimage/spotset/Identified/q) is a molecular biology database maintained at the Quest Center at Cold Spring Harbor Labs [2]. In particular, REF52 focuses on the technique of two-dimensional gel electrophoresis, a highly visual technique in which a distribution of the proteins in a cell is made by mechanical means, with pH (acidity) emphasized on one axis and Mw (molecular weight) on the other. The theory is that each protein will have a unique position when these two factors are both known. The resulting "gel" is then scanned into a computer for further study.

Computers become essential to the study of gels very quickly, because the number of proteins present in a given cell is very high. In order to study them effectively, it is useful to be able to categorize, describe and display the proteins in a manner which permits many scientists at once to view them. The remainder of this paper will discuss the techniques used to achieve this goal through the World Wide Web.

The delivery of 2D gels through the WWW was pioneered by the SWISS-2DPAGE system [3], which employs several similar techniques, including the use of server-side scripts to visually navigate through gels.

Tricks of the Trade: WWW Presentation of Visual Data

The first problem in the presentation of any database on the World Wide Web is that of statelessness. The WWW protocol itself is stateless; clients do not log in and log out, they connect to retrieve single documents, and connect again to retrieve additional documents as a completely separate transaction. Even inline images on the same page constitute completely separate conversations with the server. As a result, it is difficult to create the appearance of a seamless "conversation" with the server in which searches are gradually refined.

However, this problem can be attacked by storing state in any of several ways. The newest popular technique is to store state in hidden fields in an HTML form, such as the following:

<FORM>
<INPUT TYPE="hidden" NAME="colorstate" VALUE="blackandwhite">
... Other items, including a SUBMIT button to request the next page ...
</FORM>
When the form is submitted, the hidden fields will also be included, although they are not visible to the user; this makes them convenient repositories for information about the preferences of the user.

In the REF52 gel database, a similar technique is used, although forms are not employed. (This was due to the lack of forms- capable browsers on certain platforms when the database was first put on the web.) Every document on the World Wide Web has a unique URL, which very often refers to the location of a file in the file system of the server. But this need not be the case. In the REF52 gel database, the beginning of the URL indicates to the server that the REF52 delivery software should be executed; but the remainder is parsed by the delivery software itself, in order to locate the desired information about the gel in a virtual document space.

The following example presents the URL of a particular view within the database, in this case the top-level view:

http://siva.cshl.org:8002/cgi-bin/REF52/optimage/spotset/Identified/q
The beginning of the URL, http://siva.cshl.org:8002/cgi-bin/REF52, specifies the location of our server and the location of a script which executes the delivery program for this particular database. The remainder identifies the information the user wants to see. This remaining portion is subsequently referred to as the locator.

The locator shown above, then, is /optimage/spotset/Identified/q. There are two basic kinds of information contained within: option settings, which can be applied to any view, and naming information, which specifies what objects in the database are to be viewed. All elements in the locator are separated by slashes, in keeping with the filesystem-like structure of HTTP (hypertext transfer protocol) URLs.

Any number of options can be present. In locators employed by the Global Gel Navigator (the REF52 delivery software), options always appear first in the locator and begin with the letters "opt"; they can have a numeric portion or be simple booleans like "optimage", used above to indicate that the user does want images transmitted as part of the response.

The remainder of the locator is more firmly structured. The first element following the options specifies what type of view is being requested, in this case an overview of a spotset (collection of proteins; the entire gel in this case). The second typically specifies the particular instance of interest, in this case the spotset of all identified proteins.

The third element, in this case, is used to specify the quadrant which the user is presently displaying. In this case, the lone "q" indicates an overview; "q2" would indicate the second quadrant; "q23" would indicate the lower-left corner of the second quadrant, and so on. This mechanism can be used to hierarchically descend the document by quarters, a mechanism that strikes a balance between flexibility and the probability that a view will be requested by more than one user, allowing for the possibility of caching.

Links as Controls

When the user selects a link in the top-level REF52 document, most of the options and locator information present in the parent document are present in the URL, with variations appropriate to generate the new view. In this manner, the impression is created of a seamless connection in which links act as "controls" modifying the previous view.

The REF52 database provides numerous textual links, which are used to select particular sets of proteins, to select proteins themselves, and to select among options such as image size and color versus black-and-white display.

In addition, however, the image itself acts as a control, taking advantage of the capability of WWW browsers to return a new locator to the server requesting a further magnification of a particular quadrant of the gel image. This capability is provided in addition to, not instead of, textual controls; all WWW developers should remember that not all users have graphics capabilities.

Server-Side Caching: Attacking the Performance Problem

While the above scheme satisfies the user's desire for a seamless environment, it appears to have a performance problem: every document seemingly must be created anew each time it is requested, with all the work being repeated for each new request.

However, locator URLs such as those shown above lend themselves readily to a server-side caching scheme. By storing the complete response to previously requested locator URLs and throwing out old responses on a least-recently-used basis, it is possible to quickly evaluate whether the desired view has already been generated and stored in the cache, and deliver it to the user immediately without the need to calculate its actual contents.

The advantage of such a scheme is that, if there are in fact views which are commonly requested, they will be served quickly from the cache, while if there do not turn out to be such views, nothing has been lost except for the disk space used to store a fixed number of views. This compares favorably to a system in which all possible views are stored; the combinatorial explosion of possible views, especially when various options are applied to adjust the resulting display, will typically render such an exhaustive approach impossible or impractical.

Attacking the Inline Image Problem

Caching solves the problem of fast delivery of popular views. However, there is still the question of reasonably fast inline image creation and delivery when the view is not already cached.

A very large number of images potentially exist within the REF52 database. It is possible to magnify the gel by selecting three or more levels of quadrants; it is also possible to display magnified views of numerous specific proteins. When the highest level of quadrant magnification is reached, specific proteins can be clicked upon.

Figure 1: An interior portion of REF52, magnified twice

In addition, the protein data itself is occasionally changed as observations are made by the maintainers of the database. Thus, exhaustive storage of the many possible images is not a practical option.

Instead, Global Gel Navigator creates inline images on the fly. In the case of the REF52 database, a large existing image is used as the basis, and an appropriate portion is cut out and magnified, after which information about various known proteins is drawn on the image. In the case of other databases under test, there is no basis image, and the entire image is created on the fly based on theoretical data.

Since most images on the World Wide Web are delivered from stored files, users expect a fast response to image requests. But existing WWW systems that create images on the fly typically do so slowly, often by creating a 24-bit image in memory, writing it to a file and using the pbmplus image manipulation utilities as a post-processing sytem. Indeed, REF52 originally did this, using a custom addition to the pbmplus utilities called "ppmfig" to draw additional information on gel images.

However, in order to greatly enhance performance, the gd dynamic GIF creation library was created. gd is a C library which creates GIF images quickly and efficiently. It achieves this by taking advantage of the essentially simple 8-bit nature of GIF images in its design, and by avoiding the use of numerous distinct filter programs. gd is freely available for use in other projects; see the URL http://siva.cshl.org/gd/gd.html for details.

Since our applications have called for the delivery of cropped, scaled portions of existing images with additional information drawn on them, the features of gd are stronger in these areas than in others. In particular, polygon filling functions are missing, although they are easily simulated through the use of an additional color and the existing flood fill functions. gd is expected to grow in these areas, however, as well as in the area of text drawing.

The final compression operation which produces a GIF image unavoidably consumes processor time. Uncompressed image formats exist, but they are not supported by most browsers, which would defeat the purpose of using the WWW for portability; and the delay of waiting for them to be delivered across the Internet would be even worse than the delay associated with compression.

However, there is an operation which can be accelerated: the basis image which is to be cropped and magnified can be stored in an uncompressed, otherwise GIF-like format known as the gd file format. Support for this format was introduced in gd as a means of avoiding the time lag associated with uncompressing the basis image; the cost is that the basis image must be stored at its full byte-per-pixel size, but since there are only a few basis images this is acceptable in exchange for the performance gain.

Future Directions

Security

The REF52 database contains data which is meant to be viewed by as many people as possible. As such, it does not currently face security problems. However, we may eventually wish to permit annotations through a forms interface, and such a system would require a solution to the security question.

It is possible to require authorization of every transaction with the server, and for systems involving the exchange of funds this may be the best solution to the security problems surrounding annotations. But as a solution involving only existing browsers, it is possible to include a password as part of the URL. Since URLs can be read over the shoulders of users, or packet-sniffed from networks, this is not a sufficient solution for cash transactions, but it is viable for applications in which security is intended to prevent casual abuse of the system.

Remote Content Creation

At present, graphical editing and other activities that create new content in the database require the use of the Quest II software [4], a package which is available only for the Sun Sparc platform; remote display via the X Window System Protocol, while theoretically an option, is far too slow to be practical due to the inefficiency of using a low-level protocol in which every action of the mouse requires a transaction. If the capabilities of the World Wide Web were to broaden to include a sufficient range of user interface activities, while continuing to operate on a very high level in which the user can submit a large amount of information in one response to a form (such as a group of selected and regions from a displayed image), it may become practical to actually create graphical databases like REF52 over the WWW. At present, the WWW is only sufficient as a mechanism for exploring the database, and potentially for annotating objects with text.

Scalability to High-Load Environments

As mentioned earlier, the Global Gel Navigator uses a server-side caching scheme. Under the current algorithm, maintaining the consistency of the cache requires that, while a view is generated, the database be "locked" through the creation of a lock file. This prevents other users from generating views until the first user's view has been completely calculated.

This does not mean that the other users must wait for the first to download the view completely. Before the actual delivery of the view begins, the completed view is copied to a temporary file and the lock is removed. But it does limit the degree to which the Global Gel Navigator can take advantage of multiprocessor systems to calculate several views at once, and it would be desirable to employ a more sophisticated caching algorithm to avoid the need for such locks.

Conclusion

In order to deliver the REF52 database in a seamless fashion, the Global Gel Navigator overcomes several challenges. The most fundamental relates to design; the WWW is stateless, yet the database is most profitably explored when small adjustments can be made to queries. This is accommodated via the delivery of state information in URLs. Other challenges involve performance, and are solved through the use of server-side caching and a fully linked-in, efficient GIF creation library. In the future, the Global Gel Navigator will likely improve to take advantage of forms and face the challenge of providing security.

References

[1] Berners-Lee, T.J., Cailiau, R., Groff, J.F., and Pollerman, B., Electronic Networking: Research, Applications and Policy, 1992, 2, 52-58. See also the URL http://info.cern.ch/.

[2] Garrels, J.I., and Franza, B.R. J. Biol. Chem. 1989, 264, 5283-5298.

[3] Appel, R.D., Sanchez, J., Bairoch, A., Golaz, O., Miu, M., Miu, M., Vargas, J.R., and Hochstrasser, D.F, Electrophoresis 1993, 1232-1238. See also the URL http://expasy.hcuge.ch/ch2d/ch2d-top.html.

[4] Monardo, P.J., Boutell, T., Garrels, J.I., and Latter, G.I., CABIOS 1994, 10, 137-143. See also the URL http://siva.cshl.org/software.html.