Francisco M. De La Vega1 and Imelda Saldaña2
1 Computer Unit of the Department of
Genetics and Molecular Biology
2 Biological Area Library,
CINVESTAV-IPN. A.P. 14-740, México DF 07000, México.
One of the most neglected information resources in higher educational institutions is also the most expensive to produce, the main product of postgraduate students during its stay in the institution: the doctoral and masters theses. Libraries in degree granting institutions hold large amounts of printed theses that are only available for access directly in the library or through inter-library loan. While some of the information contained therein is eventually published in the form of scientific papers or books, plenty of details and unpublished results still remain locked in these inaccessible reservoirs. This report presents the approach we are working with at the Biological Area Library of CINVESTAV (Mexico City) to provide Web access to the full-text of our doctoral and master theses, in principle available to all Internet users, unlocking these resources. Theses manuscripts are provided to the library in electronic (wordprocessor) format directly from the students. These files are converted later to RTF and then imported to a Hyper-G server through its specialized clients. After deciding the access rights of the documents, they become available through the Hyper-G WWW gateway, searchable in full-text. Images are scanned and imported to the server and links from the text document are created interactively by the library personnel, based on a printed copy of the thesis. We are currently at the pilot stage implementation of the system, but the project envisions the direct import of the theses by the students in personal workspaces within the Hyper-G server. Access control and annotations can provide the basis of an electronic review procedure of the manuscript by the advisory committees previous to thesis defense, and later, a gradual release of the contents to the international community. This would involve a change in the traditional philosophy of thesis review that would increase the efficiency in time, and ease of thesis production. The availability of the electronic versions of the theses, will certainly increase their circulation as compared to printed theses circulation, which barely reaches 0.48 per thesis per year. Finally, we believe a digital library would be an important resource for other Latin American institutions where similar research is being carried out, and want to access information in their native language.
Theses are the immediate result of student's work during postgraduate studies. Various parts of the contents of a Master (MSc) or Doctoral (PhD) thesis or dissertation usually undergo formal publishing in specialized journals or books. However, due to space economy and other reasons only a selection of the content of the thesis is published, leaving the original printed thesis as the only resource for full details of the work accomplished. Unfortunately, theses are usually locked resources only available through a direct visit to the library or inter-library loan. Furthermore, most of the time the sole existence of a thesis remains hidden, especially in Latin America, since the only electronic resource for locating thesis is UMI's Dissertation Abstracts database which covers theses produced in some US Universities [0]. This latter resource, while valuable, is partial and only provides the abstract of the theses, the only way to obtain a specific thesis once located being the inter-library loan (practically absent across country boundaries) or through a fee to UMI to obtain a printed version of the material they hold.
CINVESTAV (an acronym for Center of Research and Advanced Studies) is a research center sponsored by the Mexican government's Public Education Ministry (SEP), and formally part of the National Polytechnic Institute (IPN) [1]. CINVESTAV's main goals are to perform basic and applied research in most areas of knowledge and to provide high level (MSc and PhD) postgraduate education. Yearly, CINVESTAV produces about 150 MSc and PhD thesis -- these are the result of 3-6 years of a student's full-time dedication. The Biological Area Library (BAB) is a specialized library which supports the work of the Biological Area Academic Departments of CINVESTAV (Biochemistry, Cell Biology, Experimental Pathology, Genetics and Molecular Biology, Pharmacology, and Physiology), its >60,000 - volume holdings include about 500 theses produced at a rate of about 50 per year [2]. Theses, which are completely written in Spanish, are an important resource for local students, who search these for detailed methodological procedures, unpublished results or literature reviews. While theses are one of the few direct original publications of the institution, theses circulation at BAB is limited to 0.48 per thesis per year. A project to increase the access to this resource not only for the local community, but also to the whole Latin American and international community, is part of a major effort under deployment since 1995 at BAB, whose goal is to automate its processes and to provide on-line services through the Internet and WWW.
The BAB Library Automation Project includes the automation of circulation, purchases and cataloging of its holdings through the deployment of Ameritech's Horizon Library Automation software package, and the implementation of a Digital Library which should provide access to information assets of the library through the Internet, as well as an organized collection of pointers to other related Internet resources of interest to local patrons. Currently, we provide on-line (WWW) access to several bibliography databases (MEDLINE and Life Sciences Collection) that are acquired in CD-ROM format. This information is downloaded to a local server's hard disk to provide fast, concurrent network access to up to 10 users taking advantage of Silver Platter's Electronic Reference Library (ERL) technology [3]. Due to licensing conditions, this access is restricted to CINVESTAV's patrons and nowadays is one of the most requested services of the library. Recently, we started the implementation of the Digital Library (DL) to provide network access to electronic documents, newsletters, bibliography lists, course notes, electronic proceedings and full-text, electronic versions of the MSc and PhD theses produced by the Biological Area Departments.
Nowadays, there is tremendous interest in digital libraries. Several major projects have been funded in the US to provide access to digital versions of books or journals, and other multimedia collections. The University of Illinois' Digital Library Initiative is one of such projects [4]. The University of Illinois project is a large-scale testbed which concentrates in digitalization of scientific literature in particular, science and engineering journals through agreements with major publishers, which directly provide digital input. Another major initiative is Stanford Digital Library Project [5]. These projects are funded by NASA/DARPA/NSF and are aimed at the deployment of experimental Digital Libraries under the NII project of the US [4]. Both libraries require the use of specialized SGML clients for access, and not the usual Web browser, and due to copyright restrictions, the content can only be fully accessed by authorized patrons. This renders these major projects irrelevant for the international academic community, at least for the moment.
Other very interesting digital library project is one started by Virginia Tech, specifically aimed to the development of a National Digital Library of Theses and Dissertations (NDLTD) [6]. This project which currently is in pilot phase, allows the access to theses produced by Virginia Tech, and in the near future by other southeastern Universities. Theses are imported into the system directly by students, which in the process become digital literate. Currently, it is possible to access these theses in PDF format [7], though the project is committed to the use of SGML format as well. Thus, while there is Web access to these holdings, the actual rendering of the documents is accomplished trough the Adobe Acrobat Reader, or in the future, by an SGML reader.
Finally, there are a couple of Digital Libraries which include among its holdings theses and dissertations constructed on the so-called second generation HyperMedia system, Hyper-G (now renamed to HyperWave in its commercial version) [8]. An example is the Digital Library of the Institute for Information Processing and Computer Supported New Media (IICM) of the Graz University of Technology [9], in Austria, where Hyper-G was created mainly by Hermann Maurer and Frank Kappe. In this library, it is possible to access the abstracts of the recent theses developed at the Institute and the University. It is presumed that the full-text of the theses exist in the server, though due to the fine granular access control possible with Hyper-G, anonymous access is restricted to the full text. DogitaLS1 is another digital library implemented on Hyper-G at Dortmund University in Germany [10]. Digital Libraries comprising electronic theses are completely absent in Latin America, to our knowledge. This is very unfortunate, as in Latin America a difficulty in accessing literature written in English language still persists among graduate students, and thus resources available in native languages (mainly Spanish) would be very valuable.
The goals of the Electronic Theses and Dissertations section of BAB Digital Library are:
We selected the Hyper-G information system to build the Digital Library upon, for several reasons:
The Hyper-G server software was installed in the BAB main server (a Sun SparcStation workstation equipped with 64 Mbytes of RAM and 14 Gbytes of hard disk) and the Web gateway (the "Wavemaster") was configured to coexist with other Web servers and services supported by the system. Access to the server is granted through the library WWW home page (http://www.bab.cinvestav.mx) [2], through Hyper-G's Wavemaster (http://hyperg.bab.cinvestav.mx:8000) or directly using Hyper-G client/server protocol ( hyperg://hyperg.bab.cinvestav.mx). Administration of Hyper-G is carried out either from the special UNIX client, Harmony, or through the librarians PC/Windows 95 based workstations using Hyper-G Windows client, Amadeus [8]. Students were requested to deliver their complete manuscript files in any word-processing software they had used to produce them (commonly Word or WordPerfect) in floppy discs, together with the printed copy of the thesis they should deliver to the library under current regulations. The wordprocessor files are converted to RTF and then uploaded to the Hyper-G server through Amadeus (See Fig. 0). Amadeus can filter text, HTML and RTF files to HIF format, the internal Hyper-G format[8].
Images incorporated by students into the theses are normally only available as photographic prints or drawings, but not in digital format. Thus, library personnel scan the images directly from the printed version of the theses and convert them either to GIF or JPEG formats to save space. After uploading the images in the same Hyper-G collection where the text document resides (a collection per thesis), inline links are created interactively with Amadeus from the thesis' text to the corresponding images by library personnel. An abstract of the thesis in English and Spanish in separate documents, as well as keywords necessary to construct the meta-information attributes, are requested from students. The abstract documents are placed in a cluster fashion together with the thesis assigning the appropriate language attributes; this will allow browsing the collections with the preferred language setting -- choosing English will show both the original thesis and the abstract in English, or choosing Spanish will result in browsing both documents in Spanish (this is the default language). Theses collections are organized hierarchically (by MSc and PhD and by Departments) and in alphabetic order taking into account their author's last names, a customary classification followed at BAB. Technically, due to Hyper-G features, is possible to construct different views of the collections according to author's names, Departments, theses title, etc., without actually duplicating the files in the database (Fig. 1). Finally, a link is inserted from a welcome page to each thesis to ease the access to the collections, though the users are able to browse through the DL content on their own way. The "title" attribute of Hyper-G objects is very important to construct searches, and thus we included name of the author, title and year of publication in this field (Fig. 2; cf. 10).
A project exists to recover old thesis in digital format and thus include them in the DL. This would involve scanning and OCR transfer of full text into electronic format, some text editing in a wordprocessor, to later export the text as RTF for import by Amadeus. Since the number of printed theses currently exceeds 500, this would take about 10 months. Images would be scanned and uploaded as described.
One of the main advantages of using Hyper-G as a database for documents, is that any document uploaded is immediately available for full-text keyword searching through Hyper-G integrated search engine. Thus, besides the advantage of the ease of access of the theses from the desktop without the need to attend the library, a major benefit is the capability of searching through the entire collection for specific keywords or phrases. While many interesting features of Hyper-G can only be appreciated through its specialized clients, Harmony and Amadeus, most users would only access the DL through their preferred Web browser (Netscape Navigator is CINVESTAV's most used browser) using the Wavemaster (Fig. 3). Hyper-G new release (version 2.0) includes an improved Wavemaster with the capability of document uploading and annotation upon appropriate authentification [11]. Through authentification it is possible to grant differential access to the DL content. This is very useful when a thesis content includes "hot" results, which are in the process of being published through scholar journals, or material under patent application. Thus, anonymous browsing can be restricted for some time and when concerns related to world-wide distribution of the theses disappear, they can be released to the general public. The latter can be accomplished through direct modification of access rights, or change in access rights at specific dates [8]. Students are required to fill a form where they specify when and how access is to be provided.
A problem with the current database is the structure heterogeneity of the main texts of the theses. Even if RTF format is able to preserve some formatting and structuring information during the filtering of the theses to HIF, most students don't consistently and properly use style sheets and tags in their wordprocessors documents. Thus, some theses lack formatting/structuring information, while others have a variable set of formatting. In fact, this problem is exacerbated by the fact that CINVESTAV lacks a style guide and wordprocessor templates for writing theses. At this point we are more concerned with having the theses on-line; the formatting problem will be addressed later (Fig. 4). Students normally do not deliver keywords and other meta-information when they deliver printed theses -- the implementation of the DL now requires students to be more conscious of the new role their theses may have if published electronically. While many of the figures included in biological area specialties theses are photos taken from experimental specimens or instruments, and thus they are not in digital format, most of the drawings, flow charts and diagrams can be mastered digitally, but students need some more infrastructure and training for that. Furthermore, Biological Area Departments at CINVESTAV are acquiring modern instruments which produce digital output instead of the traditional paper plots or photos. However, students need to be more conscious of the advantages of using digital output from the start to the end of the data analysis and figure production.
CINVESTAV possess a fast, fiber optic, ATM based WAN and most students have access to networked personal computers at their Laboratories or Departmental computer rooms. Faculty almost invariably have desktops linked to the network and a Web browser running. Therefore, any development which takes advantage of this infrastructure and provides an added value, is generally most appreciated by CINVESTAV's members (of course, there are always those who prefer the "old ways"). Web access to Bibliography databases started in July 1996 at BAB, and produced a tremendous impact in their access (about 10 times their access from the Library's in-house terminals) and in patron satisfaction. Even if BAB DL is in it's pilot stage, many faculty members have expressed enormous interest in the project. This project is raising many issues in the academic community that have been ignored for years, like copyright (in Mexico copyright is granted only if a manuscript is officially registered and theses or dissertations almost never get registered), plagiarism, and unfair competition by other parties granted access to unpublished material. While these are the well recognized dilemmas of the new information age, they will mark a new experience for this community.
Nowadays, students are requested to deliver their theses in digital format by informal procedures. It would be necessary, if the goals of this project are to be fulfilled, to implement formal procedures and regulations which would enforce the contribution to the DL. A great deal of lobbying and talks on the subject would be needed to accomplish this. The development of a thesis style sheet and wordprocessor templates are necessary. Also, workshops where students are trained in digital production would be necessary to ensure high quality of the database and faster processing. The experience gained in this project can be transmitted to other libraries within CINVESTAV and nearby institutions.
Direct student submission to the Hyper-G server would be desirable and is possible. Hyper-G can accommodate personal workspaces accessible through authentification. Once an user is given write privileges in a specific collection, he/she can upload documents and set access privileges. This leads to a very interesting proposal: the normal thesis review process by the student's advisory committee can be accomplished digitally through the network. A student can upload a draft version of his/her thesis and allow his advisors to access it (under password access control). Through Hyper-G, annotation mechanisms are possible for the faculty advisors to correct or suggests changes in the document. The student can review these annotations and in turn produce a new version. Hyper-G versioning tools can support this activity. The final version of the thesis manuscript can be set to be accessible, say by the whole Department or institution the date of thesis defense (control by IP number). After some time, depending on student's concerns regarding thesis content, access can be granted to the whole country or wold-wide (Fig. 5). Remote access through the Internet to early drafts of a thesis can be very useful since the student's advisory committee normally require external advisors that would benefit from electronic delivery. It is also possible to monitor (with the help of the administrator) when and where a particular thesis was accessed.
Advanced use of Hyper-G information system and in general, of a digital infrastructure to support electronic publishing requires significant education of the users - librarians and patrons in this case. While Hyper-G is a sophisticated software, the simplicity of their clients helps in the deployment of the technology. Very important is the fact that the main operators of the system will be librarians, most of them without a specific training in high-tech systems. They need something very simple to use with minimum training. We believe that a key factor for general acceptance of a DL is the possibility of access with an universal Web browser. While many "digital librarians" debate on the necessity and benefits of using SGML structured documents in DL's, ease of access and implementation will determine the success of a system. Currently, producing SGML documents involve using commercial editors and clients -- this greatly restricts widespread adoption. Also, many digital libraries provide support for PDF files. While in Hyper-G is possible to provide several versions of the same document (eg. HIF, PostScript, PDF) we do not feel compelled to provide PDF versions of our theses. The production of PDF files requires the deployment of special comercial software in the student's computers. PDF files are big and do not provide much more benefits than printed versions of the theses. We found Hyper-G to be a mature and simple solution, with a very low cost at least for academic institutions.
Finally, the whole philosophy of electronic publishing and the daily utilization of modern networked tools, needs to be cultivated. A commitment by top officers of an Institution is required if a project is to pass its pilot stages into full-production. Acceptance by the local community is also key to the sustained development and maintenance of a Digital Library. We believe time is ripe in Mexico and Latin America for this efforts, and the large advantages that can be obtained from a DL outweight by much the inherent difficulties in their deployment.
We would like to acknowledge the continuous support and interest in this project of Dr. Adolfo Martínez Palomo, CINVESTAV's General Director. The original Project for the Automation of the Biological Area Library was organized and supported by the Biological Area Library Academic Committee. The system administration support from Alberto Martínez Díaz and José Luis Enriquez is greatly appreciated. We wish to thank Hyper-G developers for creating this software system and for providing free licenses to CINVESTAV through its commercial enterprise, HyperWave Inc. The provision of free licenses of Netscape Suitespot products by Netscape Communications Corp. under the Edu-Drive program is also acknowledged. We are grateful to the generous funding provided for this project by the National Council of Science and Technology of Mexico (CONACYT) under the project code F511-N9306, administered by Dr. Esther Orozco. FV wishes to thank the travel support received from the General Direction to attend WebNet'96 Conference, where he discovered Hyper-G.
[0] UMI Dissertation Abstracts WWW Page: http://www.umi.com/
[1] CINVESTAV WWW Home Page: http://www.gene.cinvestav.mx/ciea.html
[2] Biological Area Library WWW Home Page: http://www.bab.cinvestav.mx
[3] Silver Platter, Inc. WWW Home Page: http://www.silverplatter.com
[4] Schatz, B.R. Information Retrieval in Digital Libraries: Bringing search to the Net. Science Vol 275(5298):327-334, 1997.
[5] Sanford Digital Library Project WWW Homepage: http://www-diglib.stanford.edu/diglib
[6] Fox, E., Eaton, J.L. McMillan, G., Kipp, N.A., Weiss, L. Arce, E. and Guyer, S. National Digital Library of Theses and Dissertations: A Scalable and Sustainable Approach to Unlock University Resources. D-Lib Magazine,September 1996.
[7] ETD WWW Home Page: http://etd.vt.edu/etd/
[8] Maurer, H. Hyper-G. The Next Generation Web Solution. Addison Wesley, 1996. http://www.hyperwave.com/hgbook
[9] IICM Electronic Library WWW Home Page: http://www.iicm.edu/electronic-library
[10] Tochtermann, K. and Alders, T. DogitaLS1. A digital library system based on Hyper-G. D-Lib Magazine, October 1996.
[11] Kappe, F. New Features of HyperWave 2.0: http://www.hyperwave.com/Hyperwave2.0-features
Francisco M. De La Vega is Assistant Professor and Head of the Computer Unit of the Department of Genetics and Molecular Biology of CINVESTAV-IPN. His research is centered in Bioinformatics, Distance Education of BioComputing and Internet access to scientific resources. He can be reached at fvega@gene.cinvestav.mx.
Imelda Saldaña is Chief of the Biological Area Library of CINVESTAV-IPN. She is in charge of BAB DL facilities and the deployment of the Library Automation Project of BAB. She can be reached at imeldas@gene.cinvestav.mx