Interactive Database Communication Across the World Wide Web: Accessing Statistics Canada's CANSIM Database

  Interactive Database Communication Across the World Wide
                            Web:
        Accessing Statistics Canada's CANSIM Database
                              
                              
Project Manager - Brian Kroeker
Project Design/Budget - Hilde Colenbrander
Lead Programmer - Henry Fan
Technical Assistance on the Web - George Phillips

Abstract
For many years now, the UBC Data Library has been subsetting
data for library patrons from large, mainly  numeric
databases using a variety of home-grown programs, run on a
mainframe style, time sharing system.  Users have also been
able to run their own extractions using the same programs
and databases if they are knowledgeable enough about the
system and its use.  The mainframe system is highly
reliable, but as the years go by it is being superseded on
campus by large numbers of much cheaper to run, smaller and
faster systems.  Patrons are getting very used to the newer,
mainly unix-based systems, and as a result our older
mainframe system is becoming dated and time consuming to
support; in fact our central computing unit is actively
attempting to discourage its use.
To maintain service to our patrons in this changing
situation, we have been investigating both client-server
software for database extraction and uses of the World-Wide
Web (W3) and Mosaic software to disseminate information and
services from the newer, smaller systems, as these
technologies appeared to offer us the most cost-effective
solution to our problem of disappearing, centralized
mainframe service.
This paper will discuss a project we invoked at the UBC Data
Library to address some of these issues.
Reasons for going with the World Wide Web and Mosaic
The world of mainframe computing has fragmented into
hundreds of different types of processing platforms, the
fragmentation being driven mostly by end-user desires to
directly control and customize their personal computing
environment.  Centralized computer systems that only
communicate with dumb terminals, are inflexible, or are too
expensive to maintain, are being dropped at UBC and
elsewhere.  Mosaic and the W3 tempt our former mainframe
(and some who would like to be former mainframe) users with
the seeming ability to unify many of their older mainframe
computing tasks under a powerful common interface, an
interface which is somewhat under their control.  Many of
our users have had to abandon old, reliable applications
entirely and learn brand new ones, as the old applications
could not be made to work on newer, vastly cheaper computing
platforms.  Some units within UBC have opted for complete re-
writes of older applications, a very expensive and time-
consuming exercise, but necessary when a new computing
platform did not support an existing, acceptable software
alternative.  A very few users were lucky to have been
working with software products that could be ported from one
machine to the next, like Oracle.
Our users today understand better what it is they want, and
are more vocal about getting it.  They expect that the
computing resources on their PC's and workstations will be
fully utilized by newer, 'smarter' software that may run on
their platform, and maybe somewhere else.  Data and
information are expected to flow from remote machines to
theirs, and be presented using the native capabilities of
the local machine.  To this end, many programmers in many
departments at UBC have been engaged in writing client-
server applications that will satisfy these desires. .
Seeing the level of systems integration possible with
Graphical User Interfaces (GUI's) on their local machines,
our users naturally expect this level of integration from
Internet accessible applications as well.
We have to have an alternative for the central mainframe and
time-sharing services, as our patrons will soon no longer
have access to them.  Computing at UBC has become an
individual departmental responsibility, and the Computing
Center is being converted into a cost-recovered operation.
As a cost-recovered unit, the Computing Center will not be
able to provide universal access to centralized, general
time-sharing computer systems; these types of services will
be expensive.  The Library, which depends on universal
access - though not general time-shared computing per se,
therefore cannot depend on the Computing Center to provide
this service in the future and must seek out ways to provide
the kinds of centralized services it requires on its own.
Our investigations of mounting services on the W3 are a part
of that investigation.
Discoveries we've made - Problems we've found
  Popularity of the W3
  Lack of user understanding
  Reliability is key
  Two sides to W3
  Need for interface design
  Lack of programmer experience
  W3 is not conversationally interactive
Popularity of the W3
The W3 software, or at least what most users use of it via
various browsers, such as Lynx and Mosaic, is incredibly
popular.  Literally tens of thousands of copies of these
browser products are being downloaded from various Internet
sites every month.  This is good from our point of view, as
this means any services we offer on our W3 server will meet
with a potentially large base of patrons who will be able to
access us.  They will be familiar with hypertext, and how to
navigate the W3.
Lack of user understanding
Like any software product, there are various levels of
expertise among user community.  One seeming detraction of
W3 software is that most administrative, decision making
people who see a Mosaic browser in action often
misunderstand both that there is a difference between the
browser and the W3 server, and of the amount of effort
required to mount a W3 service.  Many administrative people
see only the browser, and a desire is created which would
see older, more well established software applications under
their jurisdiction front-ended by Mosaic.  On the surface,
the thought of, "If only our program X was accessible from
the Web!" is an enchanting one for information
administrators, holding out the promise of integrating
information from many diverse systems, giving them a common
"look-and-feel" from the user's point of view.  The process
of converting this vision into reality however, requires a
high investment in resources, both people and time, aimed at
careful planning and trial and error experimentation with an
emphasis on understanding just what to expect from Mosaic
and W3 software.  Programming existing computer applications
to run in concert with W3 and Mosaic software is not as easy
as simply plugging in new batteries.
Reliability is key
Mounting a computer service on the Internet is not a trivial
exercise.  At the UBC Data Library, like many other
institutions, reliability is a key factor in everything we
do.  Our patrons are trusting us to a large degree to have
the information they need, when they need it.  We decided
therefore to buy a workstation class machine and provide it
with its own connection to the campus backbone, so that it
would not have to compete with other machines for bandwidth
on the already overloaded Library LAN.  We rejected the idea
of using a PC for this application, as we felt that a PC
would not have the reliability or power to serve data to the
campus, potentially the world.
Two sides to W3
It was only vaguely understood at the beginning that by
installing a W3 server, we could not only disseminate
hypertext documents and files, but actually run programs
accessible via browser interfaces.  The hypertext side can
be maintained by a qualified clerical person, but the
programming side needs a top quality programmer.
Need for interface design
Adding Mosaic-W3 accessibility to application X presupposes
more than just making it accessible via a W3 server.  To
make the new system look and act it's best, an entirely new
interface design must be generated.  Integration with the
Web after all brings on many new challenges and
possibilities in information gathering and dissemination.
Most important is the ability to add enormous amounts of
metainformation, or information about information to our
services.  With proper interface design of W3 applications,
we can make information about our data files and services
available right alongside the data extraction programs
themselves, so that the user can "travel" between
documentation and extraction easily and simply.
Lack of programmer experience
As the W3 and Mosaic are relatively new (and highly complex
products), many programmers outside big, leading edge firms
with large research arms or academic institutions may be
unsure as to how to develop a W3-Mosaic interfaces to their
systems, as these products are probably outside their formal
education and training.  At UBC, although we have
investigated and worked with other client-server programs,
we learned many new concepts that were not covered in our
formal education when dealing with the W3 and Mosaic.  The
time investment in terms of  education and training, as well
as discovering what is and what is not possible is high
here.  A two or four hour course on the W3 will help, but it
will not make anyone a programming expert.  Documentation
available on the W3, about the W3 is fantastic, but this
presupposes that programming staff can acquire this
information and know how to integrate it with their existing
knowledge.  It presupposes that they have the time to sit
down and experiment with this new knowledge, to find out
about the W3's inner workings, and how to manipulate this
environment to suit their needs.
W3 is not conversationally interactive
A further complication arises when programmers who are used
to many existing applications that may be already highly
interactive in nature notice that typical Mosaic-W3
transactions are all one-shot, with no memory of previous
transactions maintained by either side.  For all it's
appearances of interactivity, the Mosaic-W3 interface was
not designed to be truly interactive in a conversational
sense, something many sophisticated programs today rely on.
True interaction for a library database often means that the
user will ask a few questions, then proceed based on those
questions; taking two steps forward, one back, and another
off in another direction entirely.  It's often necessary in
such scenarios to remember multiple past events, in case the
user wishes to backtrack or use information from a previous
step.
Deciding upon a W3 project
Mosaic and the W3 have excited everyone at the UBC Library
with their seeming ability to bridge the gap between our
different operating systems and become a one-stop shopping
point for the delivery of all computing services, or at
least most of the ones where we need to communicate with
other programs and people over a network.  The Library
wanted to see some useful applications of this technology,
and was looking for opportunities to fund.  To this end, we
decided to investigate whether or not we could actually
build a Mosaic - W3 interface to an existing application.
We chose a client-server application that extracts data from
a set of databases.
Assisting us in the push to get onto a unix platform owned
and operated by the Library is the decision at UBC to
effectively cancel general time-sharing services by charging
commercial rates to the campus community.  Without a central
time sharing system available to every student, departments
have had to purchase their own computing power to provide
their own students and faculty with computing resources.
Linked together by the campus Ethernet, database server
machines are becoming the way to go for those wishing to
provide cost-effective services that don't charge the end
user directly.  The Library thus will increasingly own and
operate themselves the computers that will provide
centralized database access to the campus community.
The database we used for our project is called CANSIM, which
stands for the CANadian Socio-economic InforMation database,
and is produced in Canada by Statistics Canada.  Kobus
Barnard and Walter Piovesan of Simon Fraser University, also
in British Columbia, created the client-server program that
they dubbed RDL (for Research Data Library), to extract data
from a number of online data files, including CANSIM.  This
program will run on a number of different computing
platforms, and will provide the user with either a dumb
terminal style interactive interface, or a command line
interface executable from a shell script.  This ability is
typical of many mainframe-style programs.  Unlike many
mainframe-style programs however, RDL is actually a client-
server application.   Like many full-word search style
programs, RDL benefits from a continual conversation between
the user and the database extraction program.  To have used
this application without it's conversational features would
still be possible, but we decided that this would be
perceived as a backwards step by the seasoned RDL user, who
would have noticed a distinct lack of functionality.
Goals of the project
Our goals were to:
    Provide extensive meta-information about CANSIM and how
     to use it
    Provide for password protection (the database is a
     licensed product)
    Keep a single RDL process running per Mosaic connection
    Be able to continuously feed the running RDL process
     user commands
    Get data back to the Mosaic user from the RDL process
    Create a system level job that would execute every so
     often to kill off unwanted processes
To accomplish these goals, we downloaded a copy of the W3
server from CERN, and installed it on our machine, a
SparcStation 10 from Sun Microsystems, running the Solaris
2.3 operating system.  With the splendid documentation made
available from CERN, plus some code fragments made available
by David Barber of the University of Michigan, we were able
to piece together the design of what we wanted our system to
look like.
System design criteria
Since we had chosen to use a unix workstation, it made sense
to us at the time to use ANSI-C for our program coding.   C
had the programming tools we needed and there was a lot of
familiarity with C programming at UBC.
We decided not to contact the RDL server directly from our
Mosaic - W3 interface, although we could have, but instead
opted to execute the RDL client program to process RDL
database extraction commands for us.  We discovered that if
our Mosaic - W3 software contacted the RDL server directly,
we would then have to duplicate much of what the RDL client
does already.  By contacting the RDL client, the RDL program
behaves much like many older mainframe-style programs like
SAS or SPSSx, where commands are fed to the program from an
input device (a file or a terminal), and data are delivered
to an output device (another file or terminal).  Our process
does not have a controlling terminal; neither the Mosaic
client or the W3 server constitutes a controlling terminal.
It is important to note at this point that this technique
can be used with any program that can be fed commands in a
command-style mode from a script or batch stream.  This
applies to many commercial database packages and library
cataloguing systems, so our project has a very wide
potential audience.  A few systems may have been written so
that they must have an attached terminal as input and
output; these types of systems cannot use our approach
without modification.
Initially, we were going to accept the one-shot nature of
Mosaic - W3 transactions, and limit our design to one that
would support a no-memory RDL.  We discarded this idea in
favour of a conversational mode approach after we installed
the W3 server and had a chance to test it out and read some
CERN documentation.  It soon became apparent that RDL
without a previous transaction history would be too
cumbersome and clumsy, and that the user would notice a
distinct lack of functionality when compared to the dumb
terminal version currently available on the central unix
time-sharing computing service.
We wanted the user to be able to proceed from reading
documentation about RDL-CANSIM to a hyperlink that started a
CANSIM extraction.  At this point, a connection would be
made to a password server, as CANSIM is a licensed product
and non-UBC patrons are not allowed access.  The user would
have to verify themselves, and be authorized to continue.
This process would not rely on the password facilities built
in to the current level of the CERN server because we in the
Data Library do not control the password file for the
Library as a whole.  Password administration must be
centralized to the campus, as this is a huge job requiring
many staff members, so it is looked after by the central
Library authority.  It was felt that every bona-fide member
of the campus would have a student/staff/faculty library
card, or could apply for one if they wished access to
CANSIM.  At the time of writing, our querying of the
Library's patron file has not been done, although the
database and its password server do exist.
Once into the RDL-CANSIM extraction portion of the system,
the user would be presented with a series of pre-designed
forms that would urge him/her to supply information
necessary to execute a search or an extraction.  The user
can get multiple results from multiple queries, and refine
their searches for information based on this previous
information.  Any files that they create from successful
extractions can be downloaded using the built in features of
Mosaic.
Patrons would not have to "sign off" such a system when they
are done, as the nature of the W3 allows them to go
somewhere "else"  at any time anyway, so a sign off
procedure would be pointless as it would be missed more
often than used.  In order to stop all of the processes
involved (described below) and free up unused file space, a
cron job will execute that will kill unused processes and
free up file space every few minutes.  Anyone "coming back"
to their RDL-CANSIM session after the allowed time limit had
expired would find all of their previous work destroyed, and
they would have to start again.
The nuts and bolts
The system design is now like this:  For each Mosaic client
session, after bringing up the RDL-MOSAIC page the user can
choose to extract data from the selected database (CANSIM is
one of many databases that RDL can serve up).  This action
will send our W3 server a command to begin the RDL startup
cgi-script.  This is the 'parent' script.  Normally, such a
parent script would call a program or programs to perform
some work, write its(their) result(s) to stdout, then exit.
Before the W3 server can send data back to the Mosaic client
though, it must make sure that two conditions are met.  The
cgi-script the W3 server started must have exited, and
stdout and stderr for this cgi-script must be closed.  The
stdout and stderr are normally closed automatically by the
operating system when the cgi-script exits, which is
invisible to the programmer.  This distinction is important
as we note in the next paragraph.  After transmission from
the W3 server to the Mosaic client is complete, the Mosaic -
W3 connection is severed.  The reliance of the  W3 server on
the closing of stdout and stderr as well as the exiting of
the cgi-script is what normally forces Mosaic - W3
transactions to be one-shot.
In our case, the parent process forks a child process.  This
child is passed the same environment variables that the
parent inherited from the W3 server, which contain the
initial RDL startup commands, written into the RDL-CANSIM
page.  As well, they share the same file descriptors for
stdout, stdin, and stderr.  Any time after this, the parent
can exit, because the child will continue to execute.  The
parent process terminates so that the W3 server is convinced
that the cgi-script it started has exited.  Despite the fact
that the cgi-script the W3 server started has terminated,
the child process we started owns the same file descriptors
for stdin, stdout, and stderr as the terminated parent
process, so these files are not actually closed by the unix
kernel.  Therefore both conditions have not been met for the
W3 server to be able to pass any data on stdout back to the
Mosaic client.  The W3 server must wait until this event
occurs.  When the child closes stdout and stderr, the W3
server will be satisfied, and any data on stdout will be
written back to the Mosaic client.
Alone in the world now, the child process executes a program
that defines a socket for incoming commands, and stores the
address of the socket into a file in the operating system's
filesystem, in a known, fixed location.  The name of the
file containing the socket address is the process ID (pid)
of the child process.  pid's are unique, so we are
guaranteed a unique file name.  It might be cleaner and lead
to faster program execution to store the address and any
associated information in a C data structure instead, but it
was thought to be a little more work, and so we decided to
implement this at another time.
Another child, a grandchild to the first process, is now
created to process the execution of the RDL client software.
This RDL grandchild process closes its stdout, stderr, and
stdin after a pipe (2 files that act as buffers) is set up
for interprocess communication (ipc).  With this pipe
properly set up, the grandchild can accept commands
continuously from its parent (the child), and send the child
results when it has any to pass on.  The initial RDL
commands are piped through to the RDL grandchild from the
child, which begin the RDL program.  After receiving
confirmation of RDL's startup from the grandchild, the child
writes a form on its stdout, which contains its own embedded
pid in the URL of the form.  To pass it on, the child closes
it's stdout and stderr, and the W3 server reacts at last by
sending the Mosaic client the form.  The child process'
program enters an endless loop where it listens on its
socket for incoming calls from the external world.
When the Mosaic user submits the returned form with its
embedded pid, plus any additional RDL commands made
available by this form, the W3 server starts up the cgi-
script named in this form.  This script is different from
the first one, and it in turn executes a different program.
This program will handle the processing of all RDL commands
other than the initial program startup.  It strips the
embedded pid from the URL it receives and tries to open a
file with that name.  If it is successful, it will open the
file and read the socket address.  It will connect if it can
to that socket address, and if successful, will pass the new
parsed RDL commands to the waiting child.  The child sends
them on to the grandchild via the ipc link, who in turn
returns them to the child when it's done.  The child then
sends them back to the waiting connected process, which
writes the results inside of another form to it's stdout.
It then exits. The W3 server is satisfied that the cgi
script has exited, and stdout and stderr have been closed
for this process, so it sends the waiting form back to the
Mosaic client.  The Mosaic client displays the resulting
form.  The form that the calling process sends back contains
the same pid (of the child), so that the Mosaic client can
pass it back again on the next request.
The result of this is that the child and grandchild are kept
alive for as long as we wish, able to accept commands from
other processes started by the same W3 server.  Uniqueness
is assured using the pid.
We kill off the child and the grandchild after a
predetermined time, when we decide that the Mosaic user
hasn't processed a transaction after a specified amount of
time.  On unix, this is provided for by the cron facility.
The job we will create here reads each of the pid-named
files every so often that remain in the specified filesystem
area, and compare the time stamp of the time the file was
last 'visited' to the current system time.  If a file in
question had not been accessed in the allotted maximum time
frame, then that file is read to determine the pid's of the
child and grandchild for this RDL-CANSIM-MOSAIC session.
They are then killed, and the pid file is deleted.  The user
is effectively 'signed off' the RDL-CANSIM-MOSAIC system.
As we have not built this part of the system yet, we are at
this moment unsure as to precisely what information we will
need to store in the pid-named files.
Conclusion - or - Do we STILL have MORE work to do?
Since we needed to send forms back to the user from within
some of our programs, we chose to hard-code the forms.  It
should be possible to generalize this solution so that
generic forms can be sent back to the user, forms derived by
program logic, and pre-stored in the machine's filesystem.
This feature could make our code solution much more
straightforward for others to implement at their sites and
much more flexible besides, making distribution and
installation of the code easier.
We may want to re-implement the portion of the system that
deals with the filesystem for storing pid's.  We simply may
wish to store pid aging information in a C data structure to
improve performance and simplify the cron job for killing
old processes.
We still need to tackle the problem of user authentication.
The work on the password server side has been done (for
another project), but we haven't hooked into it yet for our
purposes.
Lastly, we have to make the cron job available that will
kill off all of the expired processes and free up the file
space used.
With our solution, it may no longer be necessary to totally
discard many of our older programs.  If additional software
can be built on top of the older applications that will
allow them to run on the same platform that also runs the W3
server, then many useful applications can be saved at least
partially, and development costs of new systems may be
contained.  Being accessible from the W3 frees the Data
Library from any need for central time-sharing services.
The onus is placed on departments to network machines and to
provide local support for staff, faculty, and students.
There are high demands to this course of action, using the
W3 to interface to a program, and these are education and
experimentation time.  The education required to build our
software was costly in terms of time.  Many new concepts
were learned by us for this project, and a not a few blind
alleys were followed until proven unprofitable.  What we
have gained though is, we think, very well worth the effort.
We have gained an understanding of a system that will allow
for the eventual integration of most of our data files that
we wish to offer on-line to our users, plus an unprecedented
opportunity for the integration of many other sources of
information, such as the Library's online public access
catalogue.
Authors
Brian Kroeker - Obtained his B.Sc. in 1983 from the
           Department of Computer Science at SFU.  He is
           currently employed as a Programmer/Analyst for
           the UBC Data Library.  Mr. Kroeker provided
           initial and ongoing program design support, as
           well as project implementation details and
           project management.  Mr. Kroeker may be reached
           at: kroeker@datalib.ubc.ca  His publications
           include:
           The World Wide Web: Data Library strategies for
           dealing with the new information frontier.
           Presented at the IASSIST 94 Conference in San
           Francisco, USA, May-June 1994.  To be published
           in the IASSIST Quarterly, a newsletter of the
           International Association for Social Science
           Information Service and Technology.  ISSN -
           United States: 0739-1137
Hilde Colenbrander - Obtained her B.A. in 1973 from the
           University of Natal, South Africa, and her
           B.L.S. (Hon). from the University of Pretoria
           (1984), South Africa.  She is currently engaged
           as Head, UBC Data Library, and Electronic
           Information Services Coordinator for the UBC
           Library.  Ms. Colenbrander provided initial
           project design and sought out funding both for a
           student programmer and a machine to run our
           programs on.  She is an ardent supporter of
           increasing the public's access to computerized
           information via campus, national, and
           international networks.
Henry Fan -     Currently enrolled in the Honours Computing
           Science program in the Department of Computing
           Science at UBC (4th year).  Mr. Fan joined the
           UBC Data Library under the summer student
           program, and worked on most of the coding and
           program development for our W3 project.
George Phillips -    Obtained his B.Sc. in 1987 and his
           MSc. in 1992 from the Department of Computer
           Science at UBC.  He is currently employed as a
           Systems Manager for Imager/GraFic, IAM and
           CICSR.  Mr. Phillips is a leading proponent for
           the development of  W3 servers on the UBC
           campus.  He has written his own W3 server, and
           has consulted widely on cgi and the W3.  Mr.
           Phillips assisted The Data Library with his
           expertise on the W3, and provided some technical
           programming solutions when the going got tough.
           His publications include:
           "The Ship Model - A New Computational Model for
           Distributed Systems" presented at TENCOM '90 on
           Computer and Communication Systems, Hong Kong.
           Sept. 1990, pp. 134-140 (IEEE)
           "The Ship Model for Communication and Mobility
           in Distributed Systems".  To appear in the First
           International Workshop on Services in
           Distributed and Networked Environments (SDNE
           '94), Prague, June 1994. (IEEE)