Thomas Boutell
Gerald Latter
Quest
Protein Database Center
Cold Spring Harbor Labs
Techniques for dynamic HTML and GIF document generation will be discussed, with particular emphasis on the REF52 2D gel database system and the gd dynamic GIF creation library. Strategies for fast user response without race conditions, such as the caching of documents in a virtual space and aggressive pre-caching of likely requests, will be presented. Problems of scalability to heavy load environments will be examined.
The usefulness of server-side dynamic document generation as a front-end to existing databases will be discussed. As an example, the REF52 2D gel database will be explored from a WWW technology standpoint. The usefulness of both visual and textual means of document navigation will be considered.
On the other hand, WWW presents its own difficulties. Since the protocol is stateless, users do not "log into" and "log off" the IP (information provider)'s system, which complicates security and makes it difficult to create a seamlessly integrated presentation in which the user can make fine adjustments to a database query or other activity without the need for repetition of effort.
The limitation, then, becomes not how many platforms the prospective IP can support directly, but how many applications can be successfully delivered given the real and apparent limitations of the WWW.
Computers become essential to the study of gels very quickly, because the number of proteins present in a given cell is very high. In order to study them effectively, it is useful to be able to categorize, describe and display the proteins in a manner which permits many scientists at once to view them. The remainder of this paper will discuss the techniques used to achieve this goal through the World Wide Web.
The delivery of 2D gels through the WWW was pioneered by the SWISS-2DPAGE system [3], which employs several similar techniques, including the use of server-side scripts to visually navigate through gels.
However, this problem can be attacked by storing state in any of several ways. The newest popular technique is to store state in hidden fields in an HTML form, such as the following:
<FORM> <INPUT TYPE="hidden" NAME="colorstate" VALUE="blackandwhite"> ... Other items, including a SUBMIT button to request the next page ... </FORM>When the form is submitted, the hidden fields will also be included, although they are not visible to the user; this makes them convenient repositories for information about the preferences of the user.
In the REF52 gel database, a similar technique is used, although forms are not employed. (This was due to the lack of forms- capable browsers on certain platforms when the database was first put on the web.) Every document on the World Wide Web has a unique URL, which very often refers to the location of a file in the file system of the server. But this need not be the case. In the REF52 gel database, the beginning of the URL indicates to the server that the REF52 delivery software should be executed; but the remainder is parsed by the delivery software itself, in order to locate the desired information about the gel in a virtual document space.
The following example presents the URL of a particular view within the database, in this case the top-level view:
http://siva.cshl.org:8002/cgi-bin/REF52/optimage/spotset/Identified/qThe beginning of the URL,
http://siva.cshl.org:8002/cgi-bin/REF52
,
specifies the location of our server and the location of a script which
executes the delivery program for this particular database. The remainder
identifies the information the user wants to see. This remaining
portion is subsequently referred to as the locator.
The locator shown above, then, is /optimage/spotset/Identified/q
.
There are two basic kinds of information contained within:
option settings, which can be applied to any view, and naming information,
which specifies what objects in the database are to be viewed. All
elements in the locator are separated by slashes, in keeping with
the filesystem-like structure of HTTP (hypertext transfer protocol) URLs.
Any number of options can be present. In locators employed by the Global Gel Navigator (the REF52 delivery software), options always appear first in the locator and begin with the letters "opt"; they can have a numeric portion or be simple booleans like "optimage", used above to indicate that the user does want images transmitted as part of the response.
The remainder of the locator is more firmly structured. The first element following the options specifies what type of view is being requested, in this case an overview of a spotset (collection of proteins; the entire gel in this case). The second typically specifies the particular instance of interest, in this case the spotset of all identified proteins.
The third element, in this case, is used to specify the quadrant which the user is presently displaying. In this case, the lone "q" indicates an overview; "q2" would indicate the second quadrant; "q23" would indicate the lower-left corner of the second quadrant, and so on. This mechanism can be used to hierarchically descend the document by quarters, a mechanism that strikes a balance between flexibility and the probability that a view will be requested by more than one user, allowing for the possibility of caching.
The REF52 database provides numerous textual links, which are used to select particular sets of proteins, to select proteins themselves, and to select among options such as image size and color versus black-and-white display.
In addition, however, the image itself acts as a control,
taking advantage of the
However, locator URLs such as those shown above lend
themselves readily to a server-side caching scheme. By storing
the complete response to previously requested locator URLs
and throwing out old responses on a least-recently-used
basis, it is possible to quickly evaluate whether the
desired view has already been generated and stored in
the cache, and deliver it to the user immediately without
the need to calculate its actual contents.
The advantage of such a scheme is that, if there are in fact
views which are commonly requested, they will be served
quickly from the cache, while if there do not turn out to
be such views, nothing has been lost except for the
disk space used to store a fixed number of views.
This compares favorably to a system in which all
possible views are stored; the combinatorial
explosion of possible views, especially when various
options are applied to adjust the resulting display, will typically
render such an exhaustive approach impossible or
impractical.
A very large number of images potentially exist within the
REF52 database. It is possible to magnify the gel by selecting
three or more levels of quadrants; it is also possible to
display magnified views of numerous specific proteins.
When the highest level of quadrant magnification is reached,
specific proteins can be clicked upon.
Figure 1: An interior portion of REF52, magnified twice
In addition, the protein data itself is occasionally
changed as observations are made by the maintainers
of the database. Thus, exhaustive storage of the many possible
images is not a practical option.
Instead, Global Gel Navigator creates inline images on the fly.
In the case of the REF52 database, a large existing image
is used as the basis, and an appropriate portion is
cut out and magnified, after which information about
various known proteins is drawn on the image. In the case
of other databases under test, there is no basis image,
and the entire image is created on the fly based on
theoretical data.
Since most images on the World Wide Web are delivered from
stored files, users expect a fast response to image requests.
But existing WWW systems that create images on the fly
typically do so slowly, often by creating a 24-bit image
in memory, writing it to a file and using the pbmplus
image manipulation utilities as a post-processing sytem.
Indeed, REF52 originally did this, using a custom addition
to the pbmplus utilities called "ppmfig" to draw additional
information on gel images.
However, in order to greatly enhance performance, the
gd dynamic GIF creation
library was created. gd is a C library which
creates GIF images quickly and efficiently. It achieves
this by taking advantage of the essentially simple 8-bit nature
of GIF images in its design, and by avoiding the use
of numerous distinct filter programs. gd is freely
available for use in other projects; see the URL
http://siva.cshl.org/gd/gd.html for details.
Since our applications have called for the delivery
of cropped, scaled portions of existing images with
additional information drawn on them, the features
of gd are stronger in these areas than in others.
In particular, polygon filling functions are missing,
although they are easily simulated through the
use of an additional color and the existing
flood fill functions. gd is expected to grow
in these areas, however, as well as in the area
of text drawing.
The final compression operation which produces a GIF image
unavoidably consumes processor time. Uncompressed image
formats exist, but they are not supported by most
browsers, which would defeat the purpose of using
the WWW for portability; and the delay of waiting for
them to be delivered across the Internet would be
even worse than the delay associated with compression.
However, there is an operation which can be accelerated:
the basis image which is to be cropped and magnified
can be stored in an uncompressed, otherwise GIF-like
format known as the gd file format. Support for this
format was introduced in gd as a means of avoiding
the time lag associated with uncompressing the basis
image; the cost is that the basis image must be
stored at its full byte-per-pixel size, but
since there are only a few basis images this is
acceptable in exchange for the performance gain.
It is possible to require authorization of every transaction
with the server, and for systems involving the exchange
of funds this may be the best solution to the security
problems surrounding annotations. But as a solution
involving only existing browsers, it is possible to include
a password as part of the URL. Since URLs can
be read over the shoulders of users, or packet-sniffed
from networks, this is not a sufficient solution for
cash transactions, but it is viable for applications in which
security is intended to prevent casual abuse of the system.
This does not mean that the other users must
wait for the first to download the view completely.
Before the actual delivery of the view begins,
the completed view is copied to a temporary file
and the lock is removed. But it does limit the
degree to which the Global Gel Navigator can take
advantage of multiprocessor systems to calculate
several views at once, and it would be desirable to
employ a more sophisticated caching algorithm to
avoid the need for such locks.
[2] Garrels, J.I., and Franza, B.R. J. Biol. Chem. 1989, 264, 5283-5298.
[3] Appel, R.D., Sanchez, J., Bairoch, A., Golaz, O., Miu, M., Miu, M.,
Vargas, J.R., and Hochstrasser, D.F, Electrophoresis 1993, 1232-1238.
See also the URL
http://expasy.hcuge.ch/ch2d/ch2d-top.html.
[4] Monardo, P.J., Boutell, T., Garrels, J.I., and Latter, G.I., CABIOS 1994,
10, 137-143. See also the URL
http://siva.cshl.org/software.html.
Server-Side Caching: Attacking the Performance Problem
While the above scheme satisfies the user's desire for a seamless
environment, it appears to have a performance problem: every document
seemingly must be created anew each time it is requested, with
all the work being repeated for each new request.
Attacking the Inline Image Problem
Caching solves the problem of fast delivery of popular views.
However, there is still the question of reasonably fast inline image
creation and delivery when the view is not already cached.
Future Directions
Security
The REF52 database contains data which is meant to be viewed
by as many people as possible. As such, it does not currently
face security problems. However, we may eventually wish to
permit annotations through a forms interface, and such a
system would require a solution to the security question.
Remote Content Creation
At present, graphical editing and other activities that create
new content in the database require the use of the
Quest II software [4], a package which
is available only for the Sun Sparc platform; remote display
via the X Window System Protocol, while theoretically
an option, is far too slow to be practical due to the
inefficiency of using a low-level protocol in which
every action of the mouse requires a transaction. If the
capabilities of the World Wide Web were to broaden
to include a sufficient range of user interface
activities, while continuing to operate on a very
high level in which the user can submit a large
amount of information in one response to a form
(such as a group of selected and regions from a displayed
image), it may become practical to actually create
graphical databases like REF52 over the WWW. At present,
the WWW is only sufficient as a mechanism for exploring
the database, and potentially for annotating objects
with text.
Scalability to High-Load Environments
As mentioned earlier, the Global Gel Navigator uses
a server-side caching scheme. Under the current
algorithm, maintaining the consistency of the cache
requires that, while a view is generated, the database
be "locked" through the creation of a lock file. This
prevents other users from generating views until the
first user's view has been completely calculated.
Conclusion
In order to deliver the REF52 database in a seamless
fashion, the Global Gel Navigator overcomes several
challenges. The most fundamental relates
to design; the WWW is stateless, yet the database
is most profitably explored when small adjustments
can be made to queries. This is accommodated
via the delivery of state information
in URLs. Other challenges involve performance,
and are solved through the use of server-side
caching and a fully linked-in, efficient GIF
creation library. In the future, the Global Gel
Navigator will likely improve to take advantage
of forms and face the challenge of providing
security.
References
[1] Berners-Lee, T.J., Cailiau, R., Groff, J.F., and Pollerman, B.,
Electronic Networking: Research, Applications and Policy, 1992, 2, 52-58.
See also the URL http://info.cern.ch/.