The University of Michigan Data System

David Barber,

Coordinator of Information Technology

Graduate Library

University of Michigan

ABSTRACT

To facilitate quantitative social science research, a program which provides automated access to statistical datasets, e.g. the 1990 U.S. Census of Population, through the use of Mosaic was developed. HTML forms were used to provide dataset users with a menu of variables contained within the dataset, and to allow them to choose among several forms of output. A program was written to take the information from these forms and run a SAS program which could produce the desired output whether a subset file or a table of univariate statistics.

The effect of this development is to enable dataset users to be able to conduct statistical research without the need to understand how to write programs in the language of a statistical software package, or to understand how to manipulate the computer tapes on which such data is typically delivered. This means that more time can be spent on statistical analysis and less on the techniques required to access statistical data. The development of this program will also make it easier to share access to datasets among a group of universities. Finally, it will provide a data management system which can feed data to the even newer numerical analysis technologies such as GIS, and visualization.

THE PROBLEM

For many years, social scientists, planners, business people, and others have faced substantial technical problems when they have tried to obtain the numeric information they need in order to solve problems through the use of statistical analysis. The data they needed was most likely delivered to them on magnetic tape. These tapes needed to be stored in some facility and the user of the data had to learn a set of sophisticated computer commands to retrieve information from those tapes.

Once retrieved, this data had to be analyzed using a statistical software package. Most of these packages required that a program which would extract the needed portion of the dataset be written in the language of that package. Once the data was extracted, the data could then be analyzed which would require yet another program to be written. These steps were required even to obtain a small bit of data. Students who needed to learn how to conduct statistical research also had to learn all of the tape handling and statistical programming techniques necessary to do statistical research.

CRITERIA FOR A SOLUTION

It was clear that these tasks had to be made easier. Everyone who worked with data needed to have easier access to numeric data. It was ridiculous that the same dataset was loaded into the file format of the same statistical packages over and over again. More time needed to be available which could be dedicated to mastering a growing body of statistical techniques, and less to routine data management functions.

It was also clear that any solution needed to meet certain criteria. The data management system needed to be network accessible. It had to not only be accessible onsite, but to be accessible to researchers at home as well. Further, the system had to have the potential for even broader distributed access. Given the emphasis on inter-institutional cooperation among universities, the system had to fit into the computing environment at other schools. Finally, the system needed to have a graphical user interface which could link data with documentation or with help screens, and which could give a list of the contents of a dataset.

THE SOLUTION

Researchers, students, and staff at the University of Michigan were among the group who were faced with these problems. They too needed a data system which did not force them to learn data handling. The Digital Library Program, a cooperative initiative of the University of Michigan School of Information and Library Studies, Information Technology Division, and the University Library, took on the responsibility for finding a solution to this problem.

Part of the solution was obvious: data should be stored on magnetic disks and it should be presented to the system user in a form which could easily be used for further analysis. It was not immediately obvious how to do this however. There were several alternatives. A number of universities had developed data systems which managed data for researchers. None of these quite fit in with the Univ. of Michigan computing environment. In addition, in early fall 1993 when these issues were being considered, the impact of the Internet was becoming clear, and it was essential that data be deliverable through the Internet.

To deliver data on the Internet, there were two alternatives: use Gopher, or use Mosaic. The first alternative was quickly thrown out. Development of a user interface with Gopher seemed problematic. Gopher also did not provide a mechanism whereby dataset documentation could be tied into the data system interface. It was clearly no good to be told you can pick variables V606 or V607, if you could not determine that these represented answers to survey questions about income and race.

Mosaic appeared to have solutions to these problems. It was already growing in popularity. It certainly could handle links between pieces of text. HTML forms had also appeared for the xwindows WWW clients. These forms could be used to get the input needed to drive a program which would do the necessary data management tasks.

These important features made Mosaic one key part of the development of a data system. It had a graphical user interface which could provide a list of datasets and their contents, as well as, links to documentation. Given the extensive programming time it can take to develop a graphical client, deciding on Mosaic was a significant part of the overall solution. This decision to use Mosaic meant that it was not necessary to develop clients which would run on different operating systems, or to maintain and modify that code as those operating systems change.

The remaining problem was to determine what program should be used to handle the data management tasks. SAS was quickly identified as the solution for this problem. SAS is produced by the SAS Institute. It is an integrated set of computer programs which can handle almost any type of data management or statistical analysis. Further, there was extensive SAS expertise available at the Univ. of Michigan. It is also very commonly used at universities. Datasets are often distributed with the SAS programs necessary to convert the dataset into a SAS format file. This meant that with the use of two common tools like Mosaic and SAS it was very likely that other institutions might also use the data system once it was developed.

A FUNCTIONAL DESCRIPTION OF THE DATA SYSTEM

The first thing that the University of Michigan Data System allows the end user to do is to take data from a statistical dataset, e.g. the 1990 census. The user might decide to take just those records from the Michigan census dataset which contain information about counties in Michigan, leaving the data about towns or cities behind. There are two ways to select a portion of a dataset through this system: first, records in the dataset can be selected based on the values of the fields or variables within those records, e.g. the records where INCOME > $30,000. It is also possible to randomly select records when a fixed number of randomly selected records is needed. The system also allows these two extraction mechanisms to be combined. For example, if there were a dataset with tens of thousands of records for households with incomes less than $30,000, a random number of the records meeting this condition could be selected.

Whichever of these subsetting techniques is used, the final subset can be returned to the user in a number of ways. The data can be displayed on the users screen and then printed or downloaded depending on the functions available with the particular WWW client being used. The data can also be returned to the user via a link reference to a file. This file can then be loaded to disk. The University of Michigan data system can create SAS datasets, SAS Transport Files, and ASCII files which may have no delimiters or have tab delimiters to facilitate their use with spreadsheet software.

The data system will also allow the end user to calculate a wide variety of univariate statistics or to produce crosstabulations. These can be calculated based on all of the data in a particular dataset, or for only a subset of the data selected in one of the ways described above. It is possible to calculate standard deviations, frequencies, means, and other statistics for all or some of the records in the dataset. The result of these calculations is then returned to the WWW client.

When subsets are created or statistics are being calculated on the data in a dataset, it is possible to change the formatting for the data. Data can be recoded to allow many broad categories of a variable, such as many different income levels, to be grouped into a few categories such as 0-10,000, 10,000-20,000, etc. Data can also be either displayed as it is stored, as a series of numeric values, or it can be labelled so that

HOW THE DATA SYSTEM IS IMPLEMENTED

Software

The University of Michigan data system operates through the use of software written in C which resides with a WWW server. That software will do various things depending on the forms document which calls the program and on the selections made on that form. Since there is both a simple and an expert form for each dataset, this program responds differently depending on which form is used.

The simple form only gives a list of the variables in a dataset and allows the user to specify how they want records selected from the dataset. When a simple form is submitted, the program restates the users choices and gives them the chance to change those choices. The form returned by the program also gives the user a choice among the various forms of output the system can create. Further, the second form also allows the user to choose to reprint the that second form when changes have been made to the users selections. This provides a permanent record of those choices. Normally, forms do not record user choices which can be a problem when a complex set of decisions needs to be remembered later.

This second form is also the same as the expert form. The data system user could have gone to this kind of a form directly. They could then do the kind of things that have been just described. When the second, or expert form, has been submitted, the same program will run a SAS job to produce the type of output chosen by the user. The form contains the required information about the names of the variables desired, the location and name of the dataset, and the criteria to be used to select dataset records. The program starts a SAS session and directs the required commands to SAS. The form also tells the program what to do with the output from the SAS session. It will send univariate statistics back to the users screen, or it will send back information about how to obtain permanent files that have been created.

At this point, if a user is unsatisfied, they can go back to earlier points in the process, change their choices, and run a new SAS job. For example, they may have looked at the univariate statistics for a number of variables, then go back and decide to create a SAS dataset containing all or some of those variables.

Hardware

The data system runs on a SUN SparcCenter 1000. Datasets currently reside on hard drives attached to this machine. A test is also underway of the use of a distributed file system as a storage mechanism for the datasets. Some data will be stored in the campus' AFS storage system to determine whether it can provide fast enough access to the data.

FUTURE PLANS

The system already provides a considerable number of functions which can be used by the numeric data user. However, there are still a few features that remain to be added. The system must be extended so that when an end user of the system wants to use data from the system in another statistical package, it will write part of the program needed to import the subset data into that program. The system will also need to be extended to provide linkages to SAS/GIS which will become available latter this year or in early 1995. Where a dataset is organized by zipcode, it should be possible for the end user to get back a geographic representation of their data. Finally, the system will be extended to allow output to be sent by e-mail once Kerberos authentication is available for Mac and Windows Mosaic clients.

A number of utilities also remain to be developed which can be used by the administrator of the data system. A program has already been written which will generate the forms needed for a new dataset. Another program needs to be written which will generate a HTML codebook for a dataset. Often electronic versions of codebooks are not available for datasets. Automatic codebook generation software already exists. It is necessary to modify this software to add HTML markup. With these utilities, any other university with a SAS dataset could quickly generate the forms and codebook needed for that dataset.

IMPACT OF THE DATA SYSTEM

Implementation of this data system has been delayed because of problems with the server on which it resides. The system is only starting to be implemented with the collection of datasets which will make it heavily used. However it has already generated considerable interest from both the statistical community at the Univ. of Michigan, and the larger community of data libraries. Students, teaching assistants, survey researchers, and faculty members have all responded positively to the system. A number have expressed the desire to use the system as part of their teaching of statistics courses. All of these signs suggest that the system provides a solution to the problems which led to its development.

Data librarians from a number of other institutions have obtained the software for this system. They have done so either to try to implement it at their own site, or to use it to help them determine how to create interfaces between Mosaic and other data management systems. This means that one of the other important goals of the data system has been accomplished. When the system was begun, it was realized that successful software development for the Internet required the involvement of many institutions, each developing its own conception of the best way to manage and present numeric data through Mosaic. This would both be the best way to find out how to do this and to create the possibility for inter-institutional cooperation in the delivery of numeric data.

BIOGRAPHY

The author is director of the Univ. of Michigan Digital Library Program's numeric data project which has developed the system described by this paper. He is also the Coordinator for Information Technology of the University's Graduate Library. He manages the computer equipment and information technology programs of that Library.

dbarber@umich.edu