The Web as a Computational Engine for Chemistry and Molecular Biology.

Peter C. FitzGerald and Robert A. Pearlstein

Computational Molecular Biology Section, Division of Computer Research and Technology, National Institutes of Health, Bethesda Maryland 20892

Abstract
The functionality of the World Wide Web and Mosaic as an information distribution system has been amply demonstrated during the past year. However, with the introduction of FORMS, Web clients such as Mosaic, have acquired the ability to act as "front-ends" to an almost limitless variety of computational applications. In essence, Web clients may act as "universal front ends", offering user-friendly interfaces to many kinds of remote computational tasks. The main constraint of this approach is that a given computational task must be capable of being initiated based on a defined set of input parameters and data, and requires no further user interaction. However, while this limitation is significant, many computational tasks may be addressed within this constraint. To determine the feasibility of Web/Mosaic acting as a "universal front-end", we have developed a number of prototypes which address specific tasks relating to Computational Chemistry and Computational Molecular Biology.

Introduction
The Intramural Research Program of the National Institutes of Health (NIH) is one of the world's leading biomedical research establishments. The NIH consists of 24 separate organizational units housed in over 70 buildings, and has over 4,000 doctoral level scientists involved in more than 2,000 research projects. The computing environment serving this sizable community is a hybrid of centralized and distributed hardware and software resources. In this type of scientific research environment, minimal time is available for developing computer skills. Thus, among the biggest barriers to the use of scientific software is frequently the lack of appropriate computer training, and the difficulty in locating the appropriate application to address a particular problem. Additionally, much of the leading-edge academic scientific software is underutilized because such programs are typically characterized by poor user interfaces (e.g. command line or script driven) and little or no documentation, which results in a steep learning curve for the average user. However, coupling such programs to Web/Mosaic technology via a FORM interface, offers the potential to greatly improve accessibility of centrally maintained computer resources to users of all backgrounds.

Our ultimate goal is to develop a package of useful utilities which would specifically address the need of the NIH intramural scientific community and secondarily the scientific research community at large. Described below are the operational and prototypical Web/Mosaic resources which we have developed in order to provide this audience with simple, fast, user-friendly, access to a wide variety of scientific computer applications.

Methodology
We have used NCSA's httpd server (v 1.3) and custom built CGI programs, written in C, to provide the link between the server and the processing program. The programs we have interfaced to the Web have been very varied...ranging from commercial software to public domain code written in C or Fortran, as well as in-house developed C programs and Perl scripts.

Approach
In developing a package of useful Web/Mosaic-based utilities our initial approach has been to select specific computational tasks which by their nature:

require the input of a "relatively" small amount of text-based data (data that can be directly entered or pasted into a Mosaic text box)
accept defined values for operational variables (via Mosaic buttons or input boxes)
return as output either text, an image, or binary data in a form which can be viewed/manipulated by a user-specified local "viewer"
are computational tasks which require "relatively" little CPU time on a central server. We arbitrarily selected 5 minutes as the upper bound for an acceptable CPU time limit (Clearly the power of the server dictates the acceptability of a task).

We were also heavily prejudiced towards tasks currently addressed by software which was able, or could be easily modified, to run in "batch mode" and thus capable of being spawned by a CGI script/program as a non-interactive process.

Resource Development
In developing tools for the Web/Mosaic environment we have concentrated on the related areas of Computational Chemistry and Computational Molecular Biology. While our primary target audience is the NIH intramural scientific community our goal is to make these resources as widely available as possible within the limits imposed by licensing agreements and institutional policy. Specifically the applications we have developed are aimed at providing:

Access to molecular structure data for DNA, proteins and small organic molecules.
- A search engine for the Brookhaven Protein Databank (PDB) with capabilities to graphically display three-dimensional structures.
- Cambridge Structural Database searching and three-dimensional structure viewing.
Analysis of primary protein sequence data.
- Searching for sequence homologies against protein sequence data bases.
- Protein Secondary Structure Prediction.
- Protein Analysis --- the calculation and predication of various physical properties.
- Identification of potential protein signal sequences.
Analysis of primary DNA sequence data.
- PCR Primer prediction.
- Mapping of restriction enzyme sites.

Included among these resources are tools which range from those closely related to more traditional database searching problems (involving little computation but large data storage capacity) to those which are purely computational.

Access to these resources is normally found under the NIH Molecular Modeling Home Page and the NIH Molecular Biology Resource Page on the NIH WWW Server.

[Note:Links to all applications may not work from all sites... access is restricted due to licensing agreements and/or institutional policy.]

Discussion
While our efforts in this area are still ongoing we have found that Mosaic clients and Web-Server-linked application programs provide a very powerful and user-friendly computing environment. The advantages of using Web/Mosaic technology to provide access to central computing facilities include:

Traditional Client-Server functionality
- A client-side interface which makes use of the functionality of the client computer and has access to local resources (printers and storage).
- Access to remote high performance computing resources.
- Central maintenance of server-side resources (hardware, software and data).
Added Benefits
- Universality of Interface (one client can interface to many different types of application).
- Development time of interface is "minimal".
- Easy to interface to many existing programs.
- Machine architecture independence.
- Interface is easy to modify and distribute since the application interface is defined on the server not the client.

On the development side we have found that in many ways the selection process for the incorporation of application programs into the Web/Mosaic environment runs contrary to normal doctrine. The more primitive a program's user-interface, the easier it is to incorporate the program into this type of environment. Programs which use cryptic command-line arguments, or command files, are more easily incorporated than those which prompt for the user-selection of run time parameters. Programs which take input from "standard in" and stream their output to "standard out" also simplify the procedure.

Future Outlook
Based on the success we have had in implementing a number of prototypical examples we are working to develop a complete package of utilities which address a broad spectrum of Computational Chemistry and Molecular Biology tasks. Our goals include integrating this technology into a multi-computer environment in which task submitted via Web/Mosaic are passed-off to the most appropriate platform.

At the present time Web/Mosaic does not yet fulfill the role of a true "universal interface", since there are many situations in which its limitations preclude its use. However, in its present implementation it can solve many problems, and ongoing development offers even greater functionality for the future.

Biographies:

Peter C. FitzGerald

Academic Training:

B.A.(mod) Biochemistry, Trinity College, Dublin, Ireland - 1978

Ph. D. Biological Chemistry, University of Cincinnati, Medical College - 1983

Present Position:

Chief of the Computational Molecular Biology Section (CMBS).

Division of Computer Research and Technology (DCRT)

National Institutes of Health (NIH)

E-mail: Peter_FitzGerald@nih.gov

Robert A. Pearlstein

Academic Training:

B.A. Chemistry, Case Western Reserve University - 1976

M.S. Macromolecular Science, Case Western Reserve University - 1980

Ph. D. Macromolecular Science, Case Western Reserve University - 1983

Present Position:

Computational Chemist

Computational Molecular Biology Section (CMBS).

Division of Computer Research and Technology (DCRT)

National Institutes of Health (NIH)

E-mail: Robert_Pearlstein@nih.gov

Corresponding Author:

Peter FitzGerald Bldg. 12A, Room 2008
9000 Rockville Pike
Bethesda, Maryland 20892-5620, USA