MosaicForms Database Access ?: A Palaeobotanic case study

David. A. Gee and Professor Mike Boulter
International Organisation of Palaeobotany,
Palaeobotany Research Unit,
University of East London.
UK.
D.A.Gee@UEL.AC.UK

Abstract

The International Organisation of Palaeobotany's (IOP) Palaeobotanic Research Unit has taken a unique step in the field of Environmental Science by making its Plant Fossil Record Information System (PFRIS) available to the world via the Internet. This multi-disciplinary project not only promises major cultural changes in the field of Palaeobotany but also provides a proven implementation for further biodiversity research.

The World Wide Web, via NCSA's Httpd and Mosaic, has provided a suitable interface infrastructure for the PFRIS database, (PFR2) which contains over 700,000 occurrences and has over 12,000 taxon descriptions. Mosaic has enabled us to overcome the limitations of our terminal based systems and inspired the expansion of the original database record retrieval concept into a multi-media system with dynamic mapping tools, discussion lists, newsletters and publications.

Mosaic's HTML forms provides a very flexible database Interface tool box. With a few minor alterations this hypertext system will be able to compete with purpose built database form systems, with the added advantages of platform independence and inherent Internetability.

Since the first PFRIS release we have been offered many other non-Palaeobotanic data sets to release to the Internet community. Maybe, it is time for the Internet to expand again and absorb this information to provide usable facilities for a diverse audience. This paper will discuss the generic technical and organisational issues that surrounded the construction of the PFRIS, with the aim of assisting the breaking down of barriers and the stimulation of similar projects.

Introduction

The Plant Fossil Record first became available on floppy disks in 1991. It provided details of 10,477 genera with information on their stratigraphic and morphological descriptions, in a computer readable form, the fields being agreed in the "Frankfurt Declaration". Text retrieval software was enclosed with the first release to increase the speed of knowledge acquisition. The major deficiency of this release was the lack of any individual fossil occurrences and their bibliographic information. A major operational issue was the ever increasing amount of data, which multiplied the number of disks required and fuelled the requirements for update disks.

The Plant Fossil Record 2 (PFR2)

Due to the limitations of the PFR1, version 2 of the Plant Fossil Record (PFR2) database was designed and constructed using different computing techniques. The original 43 fields of the "Frankfurt Declaration" were extended and the structure of the database was re-modelled to reflect the heterogeneity of the data sources.[Lhotack Boulter] The original PFR1 method of providing the database on floppy disks was replaced by a single centralised Oracle database. The PFR2 currently includes descriptions and occurrences of approximately 700,000 occurrences of 15,000 genera and there is an on going process of data collection.

Accessing the PFR2

Initially the PFR2 was made available on the Internet via Telnet. This provided suitable connection but, the terminal based system limited the design of the user interface. This limitation resulted in the system being used by only a few computer literate experts. An e-mail interface based on Gary Perlmans HCI database was added this enhanced the connection to users with e-mail only connections but a better solution still had to be found..

The Palaeobotanic community called for an interface "just like my MAC" or " just like Windows". An X-Windows system had been constructed for local use but this was considered to be unsuitable for Internet access due to the extra load place on the system and the limited availability of X-Windows in the palaeocommunity. With the implementation of Mosaic Forms and the Common Gateway Interface (CGI)[Berners-Lee] reduced the load problem and the multiple implementations of Mosaic reduced the requirement for an X server. This also facilitated many opportunities:

A usable interface for the database.

The use of pictures and film.

The production of dynamic maps and statistics.

Interlinking with other WWW database systems to form a single distributed virtual database

An information system for the palaeobotanic community.

These opportunities have enabled us to re-think our original database concept and to produce an International system for information exchange.

DBMS via WWW

Database access can be categorised into three main operations querying, insertion and updating. Querying only requires the system to send information and generally does not affect the state of the database. HTML Forms provides a suitable querying user interface and the CGI combined with DBMS programming interfaces such as Oracles Pro C enables this interface to be constructed.

Insertion and updating affect the state of the database. Inserting is a non destructive action and can be implemented in a similar manner to querying with the CGI interface replying with the confirmation of the insertion. Due to the WWW being a user evoked connectionless system there are problems of concurency control during transaction processing during updating. Most DBMS avoid these problems by using locking to avoid conflicts during the updating.[see Coulouris et al for an overview]. As WWW connections are evoked by the user it is highly probable that the user may not complete the transaction and unlock the database lock thus inconveniencing other users.and reducing the database performance. Three option can be considered:

1. Re write Mosaic to include an automatic call to the database on exit. This is unsuitable due to the problems of Mosaic being unable to send a free locks command due to hardware or software failure. Another major problem is the reduction of the audience due to requirements of a special version of Mosaic.

2. Use time-out locking This would require a system to calculate the optimum time for timing-out a lock. It would also require the WWW server to control the locking of the database and thus increase the load and the number of concurrent processes.

3. Do not allow REAL updates. By producing a temporal database the requirements for updating are avoided and every update becomes a time-stamped insert. The PFR is highly suitable for this method due to palaeontology being based on historical information and the database being based on an underlying static set of information.

Another major problem of insertion and updating transactions is the problem of security. Currently it is only possible to find out the IP number of the machine which has changed the database so WWW database password have to be used. Due to these security issues the PFR only allows a limited number of users to update/insert information into the system Using a temporal system has also provided an adequate level of data integrity. With further research the secure updating of WWW databases could be implemented and allow other database users WWW access.

The User's Database Interface

A major problem with commercial database systems is that the developer becomes tied into using the vendors' interfacing tools. If these tools do not provide certain required features the developer has to produce their own. Most vendors solve this problem by providing the user with a DBMS programming interface. This usually is a database management language, such as SQL which can be embedded in a standard 3rd generation language such as ADA,C or Cobol. Further work is required in building a graphical user interface either the user produced or a standard set of library routines such as X-Windows or the MS-Windows API.

HTML Forms provides a suitable widget toolset for building database interfaces which could replace these standard library routines. It is not as extensive as some of the X-Windows libraries but it has the advantages that it requires less coding. e.g. A simple selection box in Mosaic Forms may take up to 5 or 6 lines of code while a typical X-Windows version may run to over 300 lines. Mosaic Forms also have the advantage that they are interpreted, this gives HTML Forms the advantages of being able to dynamically change at runtime and thus requires a different style of coding..

The other major disadvantages are that some of the libraries require extra networking libraries for full Internet access and there are very few libraries which are totally portable across all of the major sytems i.e. UNIX, MS-DOS/Windows, MAC. The overall requirement is for a complex program which involves databases, networking and GUI libraries. In using HTML forms the networking is inherent in the WWW and the complex GUI code is reduced to a few codes which are interpreted by a client such as Mosaic.

The WWW GUI Database Front End

The traditional terminal based form system , the simplest form of database interface can be easily implemented at the interface level by using the HTML forms fields.e.g. The interconnection between the HTML form and the database requires a specific implementation of a Common Gateway Interface script CGI [Mc COOL].

A URL designated in the HTML form indicates which CGI script should be executed. Most modern database management systems (DBMS) support some form of embedded database access language. By using this embedded language in the CGI script the DBMS can be made accessible by the WWW. This has been successfully implemented on our test systems with MS-Access and Dbase4 using NCSA's HTTP for MS-Windows and Ingres and Oracle databases using NCSA's UNIX HTTP.

The composition of the CGI script is DBMS dependent and currently heterogeneous scripts are not possible. This is confirmed by GSQL [Ng] which requires the developer to write a specific backend for their particular database. This however may change with the advent of a standard data exchange protocol.

The PFR Oracle Implementation

At the database level it is possible to construct a generic system to query tables or views. The initial PFR2 system was built to query a specific table or view as specified in the URL:

HTTP:\\.....\cgi-bin\oratable?table\viewname

The program oratable examined the Oracle DBMS catalogue to find the details of the table\view i.e. field names size and types, and then dynamically constructed a suitable HTML form. The user was then able to enter a query and on pressing the select button the above URL plus the appended the field values were sent back to the oratable CGI script (via http server) . On the receipt of a suitable query oratable connected to and queried the database. The data was then converted into an HTML page and sent back to the user. This generic system was very useful for local database management but proved to be lacking in usability for general remote use and it was also criticised for compromising security.

The oratable system provided a basic set of routines for querying an Oracle database which were implemented into a set of secure application specific CGI scripts. These had the advantage over the generic system as they could be tailored to each specific requirement thus not compromising the usability of each interface.

The PFR2 CGI scripts use two strategies for database connection:

1. A simple HTML form which calls a CGI program that interrogates the database and then presents the output to the user as another HTML form or a simple HTML page .

2. A CGI program which dynamically constructs an initial form. On the return of a suitably filled form it queries the database and presents the user with the output of the database as another form or a HTML page.

The first method reduces the system load as a remote client can have a local copy of the initial HTML form. The second has the major advantage of providing a more flexible, structured and controlled form of interaction. The PFR2 implements both strategies.

An example of the second strategy can be seen by calling the following URLs This URL calls the fossil description CGI-database script with a wildcard. i.e. give me every record starting with aa%. As the descriptions are quite long it is only practical to display one at a time so the script will produces an HTML Forms listbox with all of the genera starting with aa%, thus allowing the user to narrow their query.

As soon as the query reduces the probable output to a single description e.g. Aachenipollis the same program will return the description of the genus.

Dynamic Maps

The occurrence screen uses the first strategy. The user is provided with a simple HTML fill in form and has the options to query and sort on various fields. This produces a SQL statement which is executed. The output depends on the users preference in the selection field:

A Report produces a standard table.

The Map uses the latitudes and longitudes contained in the database to plot dynamically the fossil locations onto a world map. This uses the pbm routines which enable a programmer to add bitmaps to bitmaps and then to convert them into a Mosaic readable GIF.

The returning GIF is marked as an imagemap (IS MAP). If the user clicks on the map, Mosaic sends back the co-ordinates. With simple mathematics these co-ordinates are converted into longitudes and latitudes. This is then used to requery the database and the records at these co-ordinates can be displayed.

Distributed WWW biodiversity database

The PFR2 contains two independent databases , the first being the original PFR database which contains Fossil description in a textbased format and the second a Fossil occurrence database in a normalised relational format. They both contain a genus field yet due to performance reasons it is not profitable to combine the datasets to produce a single database. Each of these databases has its own separate CGI script. When called each of the CGI scripts generates a URL to call the other if it is required later. At the end of the Aachenipollis description query there are two buttons which were generated by the script at runtime. These buttons are contextualised hyperlinks to the occurrence database. (Note the buttons will not be created if there are 0 occurrences.) On pressing the the buttons the appropriate CGI-Database program and fields i.e. genus= Aachenipollis is called. The only requirement to make this system work is the knowledge of the location and format of the URL and its respective fields.

The The system would have to understand some level of context: Veronica has some problems with context, as many biodiversity scientist have found when querying for viruses. They become quite surprised at the response until they realise that Veronica has returned computer viruses as well as biological viruses. Another issue is fields which contain sets or subsets of fields. In the PFR2 the Smithsonian gopher link is a single field which uses a text retrieval system so there are no specific fields. e.g. the field can be used for genus, species, author, location. As genera are quite unique it is highly likely that the correct information is retrieved.

It is intended that the PFR2 system will contain all of the links to fossil plant databases and catalogues on the WWW and gophers and some interlinks to modern plants where possible within the next year. (Dependant on the owners consent and the systems being found.)

Conclusion

The WWW and Mosaic has oppened new opportunites to the PFR2. HTML Forms provides a suitable interface for databases and the CGI has provided an infrastructure to connect databases to the WWW. The WWW version of thw PFR2 is still in its experimental stage, even at this young stage it has shown how the WWW can provide an infrastructure for supporting and querying distributed databases.

I would like to acknowledge the great pioneers of the WWW and Mosaic and I hope that with these tools and good will we can keep the Internet as a place of peace and freedom.

References

Berners-Lee, T. Hypertext Transfer Protocol Requirements. CERN

Coulouris G. and Dollimore, J. Distributed Systems, Addison Weseley 1988

Lhotack, M and Boutler M, Towards the creation of an International database of palaeontology, Geological society special publication geological data and its management, In press.

McCool, R. Common Gateway Overview NCSA, Work in progress.