Interactive Database Communication Across the World Wide Web: Accessing Statistics Canada's CANSIM Database Project Manager - Brian Kroeker Project Design/Budget - Hilde Colenbrander Lead Programmer - Henry Fan Technical Assistance on the Web - George Phillips Abstract For many years now, the UBC Data Library has been subsetting data for library patrons from large, mainly numeric databases using a variety of home-grown programs, run on a mainframe style, time sharing system. Users have also been able to run their own extractions using the same programs and databases if they are knowledgeable enough about the system and its use. The mainframe system is highly reliable, but as the years go by it is being superseded on campus by large numbers of much cheaper to run, smaller and faster systems. Patrons are getting very used to the newer, mainly unix-based systems, and as a result our older mainframe system is becoming dated and time consuming to support; in fact our central computing unit is actively attempting to discourage its use. To maintain service to our patrons in this changing situation, we have been investigating both client-server software for database extraction and uses of the World-Wide Web (W3) and Mosaic software to disseminate information and services from the newer, smaller systems, as these technologies appeared to offer us the most cost-effective solution to our problem of disappearing, centralized mainframe service. This paper will discuss a project we invoked at the UBC Data Library to address some of these issues. Reasons for going with the World Wide Web and Mosaic The world of mainframe computing has fragmented into hundreds of different types of processing platforms, the fragmentation being driven mostly by end-user desires to directly control and customize their personal computing environment. Centralized computer systems that only communicate with dumb terminals, are inflexible, or are too expensive to maintain, are being dropped at UBC and elsewhere. Mosaic and the W3 tempt our former mainframe (and some who would like to be former mainframe) users with the seeming ability to unify many of their older mainframe computing tasks under a powerful common interface, an interface which is somewhat under their control. Many of our users have had to abandon old, reliable applications entirely and learn brand new ones, as the old applications could not be made to work on newer, vastly cheaper computing platforms. Some units within UBC have opted for complete re- writes of older applications, a very expensive and time- consuming exercise, but necessary when a new computing platform did not support an existing, acceptable software alternative. A very few users were lucky to have been working with software products that could be ported from one machine to the next, like Oracle. Our users today understand better what it is they want, and are more vocal about getting it. They expect that the computing resources on their PC's and workstations will be fully utilized by newer, 'smarter' software that may run on their platform, and maybe somewhere else. Data and information are expected to flow from remote machines to theirs, and be presented using the native capabilities of the local machine. To this end, many programmers in many departments at UBC have been engaged in writing client- server applications that will satisfy these desires. . Seeing the level of systems integration possible with Graphical User Interfaces (GUI's) on their local machines, our users naturally expect this level of integration from Internet accessible applications as well. We have to have an alternative for the central mainframe and time-sharing services, as our patrons will soon no longer have access to them. Computing at UBC has become an individual departmental responsibility, and the Computing Center is being converted into a cost-recovered operation. As a cost-recovered unit, the Computing Center will not be able to provide universal access to centralized, general time-sharing computer systems; these types of services will be expensive. The Library, which depends on universal access - though not general time-shared computing per se, therefore cannot depend on the Computing Center to provide this service in the future and must seek out ways to provide the kinds of centralized services it requires on its own. Our investigations of mounting services on the W3 are a part of that investigation. Discoveries we've made - Problems we've found Popularity of the W3 Lack of user understanding Reliability is key Two sides to W3 Need for interface design Lack of programmer experience W3 is not conversationally interactive Popularity of the W3 The W3 software, or at least what most users use of it via various browsers, such as Lynx and Mosaic, is incredibly popular. Literally tens of thousands of copies of these browser products are being downloaded from various Internet sites every month. This is good from our point of view, as this means any services we offer on our W3 server will meet with a potentially large base of patrons who will be able to access us. They will be familiar with hypertext, and how to navigate the W3. Lack of user understanding Like any software product, there are various levels of expertise among user community. One seeming detraction of W3 software is that most administrative, decision making people who see a Mosaic browser in action often misunderstand both that there is a difference between the browser and the W3 server, and of the amount of effort required to mount a W3 service. Many administrative people see only the browser, and a desire is created which would see older, more well established software applications under their jurisdiction front-ended by Mosaic. On the surface, the thought of, "If only our program X was accessible from the Web!" is an enchanting one for information administrators, holding out the promise of integrating information from many diverse systems, giving them a common "look-and-feel" from the user's point of view. The process of converting this vision into reality however, requires a high investment in resources, both people and time, aimed at careful planning and trial and error experimentation with an emphasis on understanding just what to expect from Mosaic and W3 software. Programming existing computer applications to run in concert with W3 and Mosaic software is not as easy as simply plugging in new batteries. Reliability is key Mounting a computer service on the Internet is not a trivial exercise. At the UBC Data Library, like many other institutions, reliability is a key factor in everything we do. Our patrons are trusting us to a large degree to have the information they need, when they need it. We decided therefore to buy a workstation class machine and provide it with its own connection to the campus backbone, so that it would not have to compete with other machines for bandwidth on the already overloaded Library LAN. We rejected the idea of using a PC for this application, as we felt that a PC would not have the reliability or power to serve data to the campus, potentially the world. Two sides to W3 It was only vaguely understood at the beginning that by installing a W3 server, we could not only disseminate hypertext documents and files, but actually run programs accessible via browser interfaces. The hypertext side can be maintained by a qualified clerical person, but the programming side needs a top quality programmer. Need for interface design Adding Mosaic-W3 accessibility to application X presupposes more than just making it accessible via a W3 server. To make the new system look and act it's best, an entirely new interface design must be generated. Integration with the Web after all brings on many new challenges and possibilities in information gathering and dissemination. Most important is the ability to add enormous amounts of metainformation, or information about information to our services. With proper interface design of W3 applications, we can make information about our data files and services available right alongside the data extraction programs themselves, so that the user can "travel" between documentation and extraction easily and simply. Lack of programmer experience As the W3 and Mosaic are relatively new (and highly complex products), many programmers outside big, leading edge firms with large research arms or academic institutions may be unsure as to how to develop a W3-Mosaic interfaces to their systems, as these products are probably outside their formal education and training. At UBC, although we have investigated and worked with other client-server programs, we learned many new concepts that were not covered in our formal education when dealing with the W3 and Mosaic. The time investment in terms of education and training, as well as discovering what is and what is not possible is high here. A two or four hour course on the W3 will help, but it will not make anyone a programming expert. Documentation available on the W3, about the W3 is fantastic, but this presupposes that programming staff can acquire this information and know how to integrate it with their existing knowledge. It presupposes that they have the time to sit down and experiment with this new knowledge, to find out about the W3's inner workings, and how to manipulate this environment to suit their needs. W3 is not conversationally interactive A further complication arises when programmers who are used to many existing applications that may be already highly interactive in nature notice that typical Mosaic-W3 transactions are all one-shot, with no memory of previous transactions maintained by either side. For all it's appearances of interactivity, the Mosaic-W3 interface was not designed to be truly interactive in a conversational sense, something many sophisticated programs today rely on. True interaction for a library database often means that the user will ask a few questions, then proceed based on those questions; taking two steps forward, one back, and another off in another direction entirely. It's often necessary in such scenarios to remember multiple past events, in case the user wishes to backtrack or use information from a previous step. Deciding upon a W3 project Mosaic and the W3 have excited everyone at the UBC Library with their seeming ability to bridge the gap between our different operating systems and become a one-stop shopping point for the delivery of all computing services, or at least most of the ones where we need to communicate with other programs and people over a network. The Library wanted to see some useful applications of this technology, and was looking for opportunities to fund. To this end, we decided to investigate whether or not we could actually build a Mosaic - W3 interface to an existing application. We chose a client-server application that extracts data from a set of databases. Assisting us in the push to get onto a unix platform owned and operated by the Library is the decision at UBC to effectively cancel general time-sharing services by charging commercial rates to the campus community. Without a central time sharing system available to every student, departments have had to purchase their own computing power to provide their own students and faculty with computing resources. Linked together by the campus Ethernet, database server machines are becoming the way to go for those wishing to provide cost-effective services that don't charge the end user directly. The Library thus will increasingly own and operate themselves the computers that will provide centralized database access to the campus community. The database we used for our project is called CANSIM, which stands for the CANadian Socio-economic InforMation database, and is produced in Canada by Statistics Canada. Kobus Barnard and Walter Piovesan of Simon Fraser University, also in British Columbia, created the client-server program that they dubbed RDL (for Research Data Library), to extract data from a number of online data files, including CANSIM. This program will run on a number of different computing platforms, and will provide the user with either a dumb terminal style interactive interface, or a command line interface executable from a shell script. This ability is typical of many mainframe-style programs. Unlike many mainframe-style programs however, RDL is actually a client- server application. Like many full-word search style programs, RDL benefits from a continual conversation between the user and the database extraction program. To have used this application without it's conversational features would still be possible, but we decided that this would be perceived as a backwards step by the seasoned RDL user, who would have noticed a distinct lack of functionality. Goals of the project Our goals were to: Provide extensive meta-information about CANSIM and how to use it Provide for password protection (the database is a licensed product) Keep a single RDL process running per Mosaic connection Be able to continuously feed the running RDL process user commands Get data back to the Mosaic user from the RDL process Create a system level job that would execute every so often to kill off unwanted processes To accomplish these goals, we downloaded a copy of the W3 server from CERN, and installed it on our machine, a SparcStation 10 from Sun Microsystems, running the Solaris 2.3 operating system. With the splendid documentation made available from CERN, plus some code fragments made available by David Barber of the University of Michigan, we were able to piece together the design of what we wanted our system to look like. System design criteria Since we had chosen to use a unix workstation, it made sense to us at the time to use ANSI-C for our program coding. C had the programming tools we needed and there was a lot of familiarity with C programming at UBC. We decided not to contact the RDL server directly from our Mosaic - W3 interface, although we could have, but instead opted to execute the RDL client program to process RDL database extraction commands for us. We discovered that if our Mosaic - W3 software contacted the RDL server directly, we would then have to duplicate much of what the RDL client does already. By contacting the RDL client, the RDL program behaves much like many older mainframe-style programs like SAS or SPSSx, where commands are fed to the program from an input device (a file or a terminal), and data are delivered to an output device (another file or terminal). Our process does not have a controlling terminal; neither the Mosaic client or the W3 server constitutes a controlling terminal. It is important to note at this point that this technique can be used with any program that can be fed commands in a command-style mode from a script or batch stream. This applies to many commercial database packages and library cataloguing systems, so our project has a very wide potential audience. A few systems may have been written so that they must have an attached terminal as input and output; these types of systems cannot use our approach without modification. Initially, we were going to accept the one-shot nature of Mosaic - W3 transactions, and limit our design to one that would support a no-memory RDL. We discarded this idea in favour of a conversational mode approach after we installed the W3 server and had a chance to test it out and read some CERN documentation. It soon became apparent that RDL without a previous transaction history would be too cumbersome and clumsy, and that the user would notice a distinct lack of functionality when compared to the dumb terminal version currently available on the central unix time-sharing computing service. We wanted the user to be able to proceed from reading documentation about RDL-CANSIM to a hyperlink that started a CANSIM extraction. At this point, a connection would be made to a password server, as CANSIM is a licensed product and non-UBC patrons are not allowed access. The user would have to verify themselves, and be authorized to continue. This process would not rely on the password facilities built in to the current level of the CERN server because we in the Data Library do not control the password file for the Library as a whole. Password administration must be centralized to the campus, as this is a huge job requiring many staff members, so it is looked after by the central Library authority. It was felt that every bona-fide member of the campus would have a student/staff/faculty library card, or could apply for one if they wished access to CANSIM. At the time of writing, our querying of the Library's patron file has not been done, although the database and its password server do exist. Once into the RDL-CANSIM extraction portion of the system, the user would be presented with a series of pre-designed forms that would urge him/her to supply information necessary to execute a search or an extraction. The user can get multiple results from multiple queries, and refine their searches for information based on this previous information. Any files that they create from successful extractions can be downloaded using the built in features of Mosaic. Patrons would not have to "sign off" such a system when they are done, as the nature of the W3 allows them to go somewhere "else" at any time anyway, so a sign off procedure would be pointless as it would be missed more often than used. In order to stop all of the processes involved (described below) and free up unused file space, a cron job will execute that will kill unused processes and free up file space every few minutes. Anyone "coming back" to their RDL-CANSIM session after the allowed time limit had expired would find all of their previous work destroyed, and they would have to start again. The nuts and bolts The system design is now like this: For each Mosaic client session, after bringing up the RDL-MOSAIC page the user can choose to extract data from the selected database (CANSIM is one of many databases that RDL can serve up). This action will send our W3 server a command to begin the RDL startup cgi-script. This is the 'parent' script. Normally, such a parent script would call a program or programs to perform some work, write its(their) result(s) to stdout, then exit. Before the W3 server can send data back to the Mosaic client though, it must make sure that two conditions are met. The cgi-script the W3 server started must have exited, and stdout and stderr for this cgi-script must be closed. The stdout and stderr are normally closed automatically by the operating system when the cgi-script exits, which is invisible to the programmer. This distinction is important as we note in the next paragraph. After transmission from the W3 server to the Mosaic client is complete, the Mosaic - W3 connection is severed. The reliance of the W3 server on the closing of stdout and stderr as well as the exiting of the cgi-script is what normally forces Mosaic - W3 transactions to be one-shot. In our case, the parent process forks a child process. This child is passed the same environment variables that the parent inherited from the W3 server, which contain the initial RDL startup commands, written into the RDL-CANSIM page. As well, they share the same file descriptors for stdout, stdin, and stderr. Any time after this, the parent can exit, because the child will continue to execute. The parent process terminates so that the W3 server is convinced that the cgi-script it started has exited. Despite the fact that the cgi-script the W3 server started has terminated, the child process we started owns the same file descriptors for stdin, stdout, and stderr as the terminated parent process, so these files are not actually closed by the unix kernel. Therefore both conditions have not been met for the W3 server to be able to pass any data on stdout back to the Mosaic client. The W3 server must wait until this event occurs. When the child closes stdout and stderr, the W3 server will be satisfied, and any data on stdout will be written back to the Mosaic client. Alone in the world now, the child process executes a program that defines a socket for incoming commands, and stores the address of the socket into a file in the operating system's filesystem, in a known, fixed location. The name of the file containing the socket address is the process ID (pid) of the child process. pid's are unique, so we are guaranteed a unique file name. It might be cleaner and lead to faster program execution to store the address and any associated information in a C data structure instead, but it was thought to be a little more work, and so we decided to implement this at another time. Another child, a grandchild to the first process, is now created to process the execution of the RDL client software. This RDL grandchild process closes its stdout, stderr, and stdin after a pipe (2 files that act as buffers) is set up for interprocess communication (ipc). With this pipe properly set up, the grandchild can accept commands continuously from its parent (the child), and send the child results when it has any to pass on. The initial RDL commands are piped through to the RDL grandchild from the child, which begin the RDL program. After receiving confirmation of RDL's startup from the grandchild, the child writes a form on its stdout, which contains its own embedded pid in the URL of the form. To pass it on, the child closes it's stdout and stderr, and the W3 server reacts at last by sending the Mosaic client the form. The child process' program enters an endless loop where it listens on its socket for incoming calls from the external world. When the Mosaic user submits the returned form with its embedded pid, plus any additional RDL commands made available by this form, the W3 server starts up the cgi- script named in this form. This script is different from the first one, and it in turn executes a different program. This program will handle the processing of all RDL commands other than the initial program startup. It strips the embedded pid from the URL it receives and tries to open a file with that name. If it is successful, it will open the file and read the socket address. It will connect if it can to that socket address, and if successful, will pass the new parsed RDL commands to the waiting child. The child sends them on to the grandchild via the ipc link, who in turn returns them to the child when it's done. The child then sends them back to the waiting connected process, which writes the results inside of another form to it's stdout. It then exits. The W3 server is satisfied that the cgi script has exited, and stdout and stderr have been closed for this process, so it sends the waiting form back to the Mosaic client. The Mosaic client displays the resulting form. The form that the calling process sends back contains the same pid (of the child), so that the Mosaic client can pass it back again on the next request. The result of this is that the child and grandchild are kept alive for as long as we wish, able to accept commands from other processes started by the same W3 server. Uniqueness is assured using the pid. We kill off the child and the grandchild after a predetermined time, when we decide that the Mosaic user hasn't processed a transaction after a specified amount of time. On unix, this is provided for by the cron facility. The job we will create here reads each of the pid-named files every so often that remain in the specified filesystem area, and compare the time stamp of the time the file was last 'visited' to the current system time. If a file in question had not been accessed in the allotted maximum time frame, then that file is read to determine the pid's of the child and grandchild for this RDL-CANSIM-MOSAIC session. They are then killed, and the pid file is deleted. The user is effectively 'signed off' the RDL-CANSIM-MOSAIC system. As we have not built this part of the system yet, we are at this moment unsure as to precisely what information we will need to store in the pid-named files. Conclusion - or - Do we STILL have MORE work to do? Since we needed to send forms back to the user from within some of our programs, we chose to hard-code the forms. It should be possible to generalize this solution so that generic forms can be sent back to the user, forms derived by program logic, and pre-stored in the machine's filesystem. This feature could make our code solution much more straightforward for others to implement at their sites and much more flexible besides, making distribution and installation of the code easier. We may want to re-implement the portion of the system that deals with the filesystem for storing pid's. We simply may wish to store pid aging information in a C data structure to improve performance and simplify the cron job for killing old processes. We still need to tackle the problem of user authentication. The work on the password server side has been done (for another project), but we haven't hooked into it yet for our purposes. Lastly, we have to make the cron job available that will kill off all of the expired processes and free up the file space used. With our solution, it may no longer be necessary to totally discard many of our older programs. If additional software can be built on top of the older applications that will allow them to run on the same platform that also runs the W3 server, then many useful applications can be saved at least partially, and development costs of new systems may be contained. Being accessible from the W3 frees the Data Library from any need for central time-sharing services. The onus is placed on departments to network machines and to provide local support for staff, faculty, and students. There are high demands to this course of action, using the W3 to interface to a program, and these are education and experimentation time. The education required to build our software was costly in terms of time. Many new concepts were learned by us for this project, and a not a few blind alleys were followed until proven unprofitable. What we have gained though is, we think, very well worth the effort. We have gained an understanding of a system that will allow for the eventual integration of most of our data files that we wish to offer on-line to our users, plus an unprecedented opportunity for the integration of many other sources of information, such as the Library's online public access catalogue. Authors Brian Kroeker - Obtained his B.Sc. in 1983 from the Department of Computer Science at SFU. He is currently employed as a Programmer/Analyst for the UBC Data Library. Mr. Kroeker provided initial and ongoing program design support, as well as project implementation details and project management. Mr. Kroeker may be reached at: kroeker@datalib.ubc.ca His publications include: The World Wide Web: Data Library strategies for dealing with the new information frontier. Presented at the IASSIST 94 Conference in San Francisco, USA, May-June 1994. To be published in the IASSIST Quarterly, a newsletter of the International Association for Social Science Information Service and Technology. ISSN - United States: 0739-1137 Hilde Colenbrander - Obtained her B.A. in 1973 from the University of Natal, South Africa, and her B.L.S. (Hon). from the University of Pretoria (1984), South Africa. She is currently engaged as Head, UBC Data Library, and Electronic Information Services Coordinator for the UBC Library. Ms. Colenbrander provided initial project design and sought out funding both for a student programmer and a machine to run our programs on. She is an ardent supporter of increasing the public's access to computerized information via campus, national, and international networks. Henry Fan - Currently enrolled in the Honours Computing Science program in the Department of Computing Science at UBC (4th year). Mr. Fan joined the UBC Data Library under the summer student program, and worked on most of the coding and program development for our W3 project. George Phillips - Obtained his B.Sc. in 1987 and his MSc. in 1992 from the Department of Computer Science at UBC. He is currently employed as a Systems Manager for Imager/GraFic, IAM and CICSR. Mr. Phillips is a leading proponent for the development of W3 servers on the UBC campus. He has written his own W3 server, and has consulted widely on cgi and the W3. Mr. Phillips assisted The Data Library with his expertise on the W3, and provided some technical programming solutions when the going got tough. His publications include: "The Ship Model - A New Computational Model for Distributed Systems" presented at TENCOM '90 on Computer and Communication Systems, Hong Kong. Sept. 1990, pp. 134-140 (IEEE) "The Ship Model for Communication and Mobility in Distributed Systems". To appear in the First International Workshop on Services in Distributed and Networked Environments (SDNE '94), Prague, June 1994. (IEEE)