Spectrum: A Web-Based Tool for Describing Electronic Resources

Diane Vizine-Goetz
Office of Research, OCLC, Inc.
vizine@oclc.org

Jean Godby
Office of Research, OCLC, Inc.
godby@oclc.org

Mark Bendig
Office of Research, OCLC, Inc.
bendig@oclc.org
Abstract:
Substantial efforts to establish standards for encoding and accessing electronic resources have occurred over the past five years. We have designed a Web-based tool, called Spectrum, to enable individuals without specialized knowledge of library cataloging or markup to create records for describing and accessing networked electronic resources of various types. System users may create descriptions of electronic resources and view them as formatted USMARC bibliographic records, TEI headers and URCs. Because we anticipate continued volatility in the definition of data element standards, the Spectrum system is designed to allow maximum flexibility in the design of the input formats.
Keywords:
USMARC format, Text Encoding Initiative (TEI) header, Uniform Resource Citation (URC), CGI script, SGML, bibliographic data, library cataloging, text retrieval

1. Introduction

1.1 Project Overview

Substantial efforts to establish standards for encoding and accessing networked electronic resources have occurred over the past five years. These standards, which are being developed primarily by librarians, humanities computing researchers, and computer scientists, are just now being implemented by their corresponding user communities. The application of these standards provides opportunities for exploring synergies among the various approaches used. This paper focuses on the relationship among three of these--the Machine Readable Cataloging (MARC) format used by librarians, the Text Encoding Initiative (TEI) Header developed by humanities computing researchers, and the emerging Uniform Resource Citation (URC) standard for accessing materials on the World Wide Web.

One result of our analysis is a prototype of a Web-based tool called Spectrum that enables individuals without specialized knowledge of library cataloging or markup to create records describing the bibliographic and location elements of networked electronic resources of various types.

1.2 Library Cataloging and Bibliographic Control

Bibliographic control refers to the activities of organizing and arranging recorded information according to established standards for the purpose of making it readily identifiable and retrievable (Chan 1994). Librarians use indexing, classifying, and descriptive and subject cataloging to organize the books and other materials libraries acquire. The outputs of these operations are sets of data elements that pertain to items in the collection. The elements for a given item are assembled into a bibliographic citation or catalog entry. When a catalog entry is formulated, the elements are recorded in a manner that uniquely identifies the item in the context of the collection. Access points or characteristics by which the record can be retrieved are also identified or included in the catalog entry. In automated library systems, e.g., online library catalogs, catalog entries are called bibliographic records.

2. Three Schemes for Describing Electronic Resources

2.1 USMARC Format/TEI Header/URC

The United States MARC format, TEI Header and URC each contain a set of elements for encoding bibliographic data for electronic resources. Of these, the URC, as proposed by Daniel (1994), contains the fewest bibliographic elements. (For a more detailed discussion of the definition and use of URCs, see the Internet Engineering Task Force Internet Draft "URC Scenarios and Requirements" at URL: ftp://ds.internic.net/internet-drafts/draft-ietf-uri-urc-req-00.txt .) The TEI header, its content strongly influenced by library cataloging practices, contains a more complete representation of bibliographic elements. Not surprisingly, the USMARC Format for Bibliographic Data contains the most complete element set since it is the standard for representation and exchange of machine-readable bibliographic data in the United States. The bibliographic format contains data elements for various types of materials, including computer files.

Although each scheme contains unique data elements designed to meet the specific needs of its user base (e.g, URC signatures and the TEI <encodingDesc>, etc.) there is considerable overlap among the element sets. A mapping of proposed URC data elements to the TEI header and the USMARC format is shown in figure 2.1. Most elements of the URC map directly into uniquely defined USMARC fields. Mapping into the TEI header is less satisfactory since four of ten fields correspond to an undifferentiated note element; the most significant of these is the Uniform Resource Locator (URL). The relationship among these schemes, as it applies to the Spectrum system, is discussed in the remainder of this paper.


Figure 2.1 Mapping of URC data elements to TEI Header and USMARC


2.2 Electronic Location and Access Data in MARC Records

In 1992, the OCLC Office of Research conducted a study of Internet resources. A complete report of that project can be found in Dillon et al. (1993). The first phase of the project focused on collecting and characterizing electronic textual information available through the Internet. In a follow-up cataloging experiment, the practical and theoretical difficulties associated with creating USMARC format bibliographic records for networked textual information resources were investigated. As a consequence of the project, the USMARC field 856 (Electronic Location and Access) was developed. This field contains all data elements necessary to locate and access Internet resources, including a data element for the URL. Prior to implementation of the 856 field in February 1995, electronic location data was recorded in one or more unstructured note fields in bibliographic records.

2.3 Text Encoding Initiative

While the library community was engaged in efforts to assess the viability of cataloging standards for Internet resources, humanities computing scholars were participating in a multi-year project to develop a general text encoding scheme for complex electronic textual structures. This effort, known as the Text Encoding Initiative, resulted in the 1994 publication of a complete set of guidelines for encoding the intellectual content of texts (Sperberg-McQueen and Burnard 1994). The TEI Guidelines for Electronic Text Encoding and Interchange are an application of Standard Generalized Markup Language (SGML), an international standard for describing marked-up electronic text. The SGML encoding standard specifies what markup is permitted and necessary and how markup is distinguished from the text. The TEI guidelines define what the markup means.

2.4 The TEI Header and Bibliographic Data

Although the TEI guidelines are focused primarily on the text markup needs of humanities scholars, chapters 5 and 24 contain detailed provisions for recording data elements important for access and bibliographic control of electronic text files. The TEI header, a mandatory part of TEI-conformant texts, contains elements for recording both bibliographic and non-bibliographic elements pertaining to electronic texts. Freestanding headers extracted from TEI documents are called independent headers. Our research extends the application of independent TEI headers to non-TEI-encoded electronic text. Efforts to extend the use of the TEI header to describe electronic resources other than text are already underway. For example, The Electronic Text Center at the University of Virginia is experimenting with embedding TEI headers in image files (http://www.lib.virginia.edu/etext/ETC.html).

A TEI-based header for the electronic version of Assessing Information on the Internet: Toward Providing Library Services for Computer Mediated Communication, available at http://www.oclc.org/oclc/menu/reschdoc.htm, is shown in Figure 2.2. In this example, the file description includes the common bibliographic elements author, title, publisher, and date, as well as several note fields that are used to record electronic location and access data. The first of these fields does not actually refer to the electronic version of the resource described by the preceding elements but is actually a Web page listing OCLC research publications. The remaining two note fields are used to give the location and access information for the postscript and compressed tar file formats of the resource described in the TEI header.


Figure 2.2 TEI Header for electronic edition of Assessing Information on the Internet: Toward Providing Library Services for Computer Mediated Communication
<teiHeader>
  <fileDesc>
    <titleStmt>
      <title>Assessing Information on the Internet: Toward
             Providing Library Services for Computer Mediated
             Communication</title>
      <author>Martin Dillon</author>
      <author>Erik Jul</author>
      <author>Mark Burge</author>
      <author>Carol Hickey</author>
    </titleStmt>
    <editionStmt>NA</editionStmt>
    <extent>NA</extent>
    <publicationStmt>
      <publisher>OCLC Online Computer Library Center, Inc.,
                 Office of Research</publisher>
      <address>6565 Frantz Road Dublin, Ohio 43017-3395</address>
      <date>1994</date>
    </publicationStmt>
    <seriesStmt>NA</seriesStmt>
    <notesStmt>
      <note>856  7 $u URL:http://www.oclc.org/oclc/menu/reschdoc.htm
            $z For an introductory page to an electronic version of:
            Assessing information on the Internet $2 http</note>
      <note>856  1 $a ftp.rsch.oclc.org $d ftp/pub/internet_resources_project/report
            $f cover.ps $s 9679 bytes $f internet.ps $s 257990 bytes $f appenda.ps $s 84957 bytes
            $f appendb.ps $s 66017 bytes $f appendc.ps $s 37973 bytes $f appendd.ps $s 46106 bytes
            $f appende.ps $s 351941
            $u URL:ftp://ftp.rsch.oclc.org/pub/internet_resources_project/report
            $z These files are in PostScript format. You may read
            them online if you have a PostScript viewer. Otherwise, load them
            to disk and print them on a PostScript printer</note>
      <note>856  1 $a ftp.rsch.oclc.org $c Must be decompressed with Unix uncompress
            $c Must be untarred with Unix tar -xvf $d ftp/pub/internet_resources_project/report 
            $f report.ps.tar.Z $s 312328 bytes
            $u URL:ftp://ftp.rsch.oclc.org/pub/internet_resources_project
            /report/report.ps.tar.Z</note>
    </notesStmt>
    <sourceDesc>
      <biblFull>
        <titleStmt>
          <author>Martin Dillon ... [et al.]</author>
          <title>Assessing information on the Internet : toward
                 providing library services for computer-mediated communication</title>
        </titleStmt>
        <editionStmt>
          <edition>No edition statement provided</edition>
        </editionStmt>
        <extent>1 v. (various pagings) : ill. ; 29 cm.</extent>
        <publicationStmt>         
          <resp><role>publisher</role><name>OCLC</name></resp>
          <place>Dublin, Ohio</place>
          <idno type='OCLC'>27635027</idno> 
          <date>1993</date>
        </publicationStmt>
        <sourceDesc>No source: this is an original work</sourceDesc>
      </biblFull>
    </sourceDesc>
  </fileDesc>
  <encodingDesc>NA</encodingDesc>
  <profileDesc>
    <textClass>
      <keywords scheme=LCSH>
        <list>
        <item>Internet (Computer network)</item> 
        <item>Cataloging of computer files</item> 
        <item>Information networks</item>
        <item>Computer networks</item>
        <item>Libraries--Communication systems</item>
        <item>Information storage and retrieval systems</item>
        <item>Library information networks</item>
        </list>
      </keywords>
      <classCode scheme=DDC20>004.67</classCode>
      <classCode scheme=LCC>TK5105.875.I57</classCode>
    </textClass>
  </profileDesc>
  <revisionDesc>NA</revisionDesc>
</teiHeader>

The TEI header is composed of four major parts: (1) file description <fileDesc>, (2) encoding description <encodingDesc>, (3) text profile <profileDesc>, and (4) revision history <revisionDesc>. The encoding description documents the relationship between an electronic text and the source(s) from which it was created or prepared. Information concerning the subject matter of an electronic text, such as subject descriptors or classification scheme, is recorded in the text profile. Changes made to an electronic text are recorded in the revision history. The file description is of primary importance in our research because it is intended to serve as the electronic equivalent of the title page of a printed work and is the only mandatory element of the TEI header.

The content of <fileDesc> was designed to correspond to descriptive cataloging standards whenever possible, but does not substitute for a full descriptive cataloging record and was not intended to do so. For example, the TEI Committee on Text Documentation chose not to include TEI header elements for the cataloging element main entry (primary access point) because library catalogers often have difficulty with this data element, and it is unnecessary if encoders include in <respStmt> everyone responsible for the intellectual content of a work (Giordano 1994). Giordano describes the relationship between the TEI header and library cataloging as follows:

...the file description exemplifies a principle of shared responsibility for the documentation of scholarly material. The intention was.... to encourage the encoder to provide enough accurate information to librarians and others in the documentation community so that professional cataloging could be carried out both effectively and efficiently.

2.5 The Uniform Resource Citation (URC)

Recent proposals addressing the content of the URC have included explicit recognition of the overlap among data elements in the URC, TEI, and MARC formats (Daniel 1994, Desai 1994), although no method for achieving compatibility across schemes has been determined. The Spectrum system does not provide a solution to this problem either. It does, however, build upon the concept of sharing the responsibility for documenting electronic files by providing a mechanism for individuals without specialized cataloging or markup knowledge to create records describing electronic resources of various types. These records can be formatted as MARC records, TEI headers, or URCs. The records will be collected into a database that will be made available for searching on the Internet. This approach presents an opportunity to discover the strengths and weaknesses of the predominant record description schemes and may lead to further refinements and improvements to them.

3. The Spectrum Record Creation Subsystem

3.1 Introduction

As shown in Figure 3.1, the Spectrum system architecture comprises three principal subsystems: (1) the Record Creation Subsystem, which allows users to create data records describing electronic resources by filling in and submitting HTML forms; (2) the Database Creation Subsystem, which builds a database of these data records; and (3) the Record Retrieval Subsystem, which allows this database to be searched for records corresponding to electronic resources of interest. The Spectrum system is built from standard Web components, including NCSA's HTTPD server with the Common Gateway Interface (CGI) protocol for invoking external processes (URL: http://hoohoo.ncsa.uiuc.edu/cgi/overview.html).


Figure 3.1 The Spectrum System Architecture


The following sections discuss the functional requirements and design objectives established for the Record Creation Subsystem (RCS) and the prototype design resulting from these considerations. This prototype RCS will provide the basis for a series of usability tests, resulting in improvements to the design.

3.2 Functional Requirements

The primary functional requirements of the Spectrum Record Creation Subsystem are as follows:

  1. The RCS will present an Item Description Form to the user. When the user submits a filled-out form, the subsystem will create a User Data File (UDF) for the item. This SGML-formatted file will include an entry for each data element provided by the user.

  2. The UDF will be evaluated for correctness, i.e. the inclusion of required information to satisfy pre-established criteria for each available output format.

  3. The UDF may be transformed into any available output format (TEI, MARC, URC, etc.) for display or download at the request of the user. In addition, the UDF will be made available to the Database Creation Subsystem, which will add the corresponding data record to the Spectrum database.

3.3 Design Objectives

The following design objectives were established for the Spectrum Record Creation Subsystem:

  1. The data entry subsystem will present a simple interface to multiple simultaneous users, retaining user data across multipage sessions.

  2. All details of form design and data translation will be specified in ASCII files, with the data entry subsystem code acting as an "engine" to interpret these files. Temporary, dynamic files created by the subsystem will also be ASCII files.

  3. The subsystem will be designed to allow maximum flexibility in the layout and appearance of the forms presented to the user, in order to facilitate usability testing.

3.4 Record Creation Subsystem Design

Overview

The prototype design for the Spectrum Record Creation Subsystem is shown in Figure 3.2. The upper portion of the figure portrays the three Web pages that are seen by the user. Upon first accessing the Spectrum URL, the user is presented with the Item Type Selection Page. This is the "home page" for the Record Creation Subsystem. It contains introductory welcoming and instructional text as well as a list of possible Item Types (Electronic Text, Electronic Journal, etc.) to allow the user to specify what kind of electronic resource is to be described. This Item Type list is implemented as an HTML form. The form is sent to CGI script S1 (shown as a circle in Figure 3.2) when the user clicks the "Submit" button.


Figure 3.2 The Spectrum Record Creation Subsystem


Script S1 performs two functions: (1) creating a Session File containing information about the current user session (such as the selected item type); and (2) dynamically generating an Item Description Form, using the Form Description File corresponding to the selected item type. After these functions are performed, the script sends the Item Description Form to the user for display.

The user now fills in the Item Description Form as completely as possible, providing all available information about this particular electronic resource. Provision is made for the user to specify the desired output record format (TEI, MARC, URC, etc.). When the user clicks the "Submit" button, the form is sent to CGI script S2.

Script S2 performs four functions: (1) creating an SGML-formatted User Data File containing an entry for each data element provided by the user; (2) creating a second file containing this same data in an internal format for use by a later script; (3) producing an output record in the specified output format based on the data provided by the user; and (4) performing an evaluation of the completeness of the resulting output record. After these functions are performed, the script sends the output record to the user, along with the results of the evaluation.

Depending on the results of the evaluation, the user may wish to print out or download the output record, or to return to the Item Description Form to enter additional data. The Output Record Page is implemented as an HTML form with no user input elements. When the user clicks on "Restore Item Description Form" (actually the "Submit" button in disguise), the empty form is sent to script S3.

Script S3 uses the appropriate Form Description File to dynamically regenerate the Item Description Form as is done by script S1. Script S3 then reads the internal-format user data file created by script S2 and sets default values for the data elements originally provided by the user. Finally, the script sends the filled-in Item Description Form to the user for rework.

Static System Elements

CGI Scripts. The CGI scripts (S1, S2, and S3 in Figure 3.2) are implemented using the Tcl programming language (Ousterhout 1994). Tcl is ideally suited for the creation of CGI scripts. It is an interpreted language, so development and debugging are simplified due to the elimination of a compilation step. The Tcl syntax is easy to learn, and offers a number of powerful features such as associative arrays, easy invocation of shell commands, and a full set of string manipulation functions.

Form Description Files. The contents, layout, and appearance of the Item Description Form presented to the user are specified in an ASCII text file called the Form Description File (FDF). There is one FDF for each type of electronic resource; for example, the Item Description Form for electronic texts is specified in an FDF called etext.f.

The FDF is formatted as a series of blocks separated by blank lines. Each block describes either a user input device (text input box, radio button group, etc.) or a graphical or textual element (heading, horizontal rule, etc.). For example, the following section of file etext.f corresponds to the portion of the Item Description Form for electronic texts shown in Figure 3.4:


Figure 3.3 A Form Description File
	 .
	 .
	 .
        # -------- Heading
        Heading 3
        Access Information
        # -------- Instructional Text
        Text
        This section contains the information required to locate an electronic item.
        The information identifies the electronic location containing the item or 
        from which it is available.
        <p>If you know the Uniform Resource Locator (URL) for this item, 
        you may enter it below.<p>
	# -------- URL Data
	Control TEXTIN
	URL:
	size=40
	url
	# -------- User Prompt
	Text
	<p>If you do not know the URL for this item, you may provide the following
	access information.<p>
	# -------- Access Method Radio Button Group
	Control RADIO
	Access Method:
        nodefault
        access_method 
	E-mail
	FTP
	Remote login (Telnet)
	Gopher
	WWW (http)
	 .
	 .
	 .


Figure 3.4 A Partial View of an Item Description Form


This section of the Form Description File contains five blocks, each beginning with a comment line indicated by a "#": (1) a Heading block which produces the heading "Access Information"; (2) a Text block which produces two paragraphs of instructional text; (3) a Text Input block which produces a text box labeled "URL:" for the user to fill in; (4) another Text block which produces further instructional text; and (5) a Radio Button Group block which produces a group of five radio buttons, allowing the user to specify the access method for the item being described.

The line following the introductory comment in each block specifies the element type to be produced. For example, the block for the Radio Button Group contains the line "Control RADIO" following the descriptive comment. Lines following the element type specifier (some of which may be blank) provide additional information about the position and appearance of the associated element. For example, the line "size=40" in the URL Data block instructs the system to produce a Text Input Box that is 40 characters wide.

Included in each block describing a user input device is a unique name for that input device. In this example, the Text Input block for the URL is given the name "url" and the Radio Button Group is given the name "access_method". The names assigned to the input devices in this file become the tags used to mark up the user data in the User Data File created by script S2, as seen in Figure 4.1 a. These same names are used to access the corresponding data within the CGI scripts. This ensures that no application terms (such as "author", etc.) are hard-coded into the scripts.

Dynamic System Elements

Session Files. Each time a user makes a selection from the Item Type list and receives a blank Item Description Form, a new session is considered to have begun. As each session begins, a new Session File containing information relevant to this session is created. In order to accommodate multiple simultaneous users, the Session File is given a name containing a unique session ID number. Thus, the first session is assigned number 00001, and its Session File is named session.00001. The next session to begin results in the creation of file session.00002, and so on. (The same naming technique is used for the user data files discussed below, in order to assure that the proper file set is associated with the proper session ID.)

The session ID number is stored in hidden fields in the dynamically generated Item Description Form and Output Record Form. The CGI scripts receive this session ID when a form containing it is submitted, allowing them to access the correct set of session and user data files.

User Data Files. As described in the design overview above, script S2 creates two files containing all data entered by the user. One of these is the system User Data File, which is used as the basis for transforming the data into a selected output record format; this file is described in Section 4.3. The other file is used to maintain the user data between the invocations of scripts S2 and S3. This retention of data supports the ability to restore the user's Item Description Form for rework. Script S3 could also rearrange the elements of the form or mark them in some graphical way in order to indicate any elements that are required for the specified output record format but were not provided by the user.

Application Programming

In accordance with the design objectives cited previously, the Spectrum Record Creation Subsystem is implemented in the form of an application "engine" and a set of control files. The engine knows nothing of the details of the system's operation, but rather acts as an interpreter to create the user experience defined by the control files. This design approach means that once the engine is in place, the actual application is created and refined by editing the control files that determine its operation. Further work on the engine itself will only be required when it is necessary to add a new feature or system behavior which is to be invoked by one of the control files.

The two principal means by which system designers can change the way the system works are by modifying the Home Page and by modifying the Item Description Form.

Home Page. The Spectrum Home Page, which contains introductory text as well as the Item Type list that the user chooses from to begin a Spectrum session, is implemented as a static HTML file. The layout and appearance of the Home Page may be changed by simply editing this file.

Item Description Form. There are separate Item Description forms for each type of electronic resource found in the Item Type list on the Spectrum Home Page. The appearance and layout of these forms is controlled by the corresponding Form Description Files, as discussed in a previous section. Our plan is to conduct usability tests with an emphasis on eliciting user comments concerning the Item Description Form; suggestions made by users can even be implemented "on the spot" by editing the Form Description File.

4. Record Translation, Record Evaluation and Database Building

4.1 Introduction

In addition to supplying appropriate HTML forms and updating the record in progress, the CGI script S2 also does the more substantial work of validating the record and translating it to the formats required for database building and presentation. This work is accomplished primarily by calling OCLC's SGML Document Grammar Builder, a set of software tools developed by Shafer (1994) that automatically identify the corpus structure of SGML-tagged documents. Since the highest interface to the Grammar Builder software is a Tcl shell, the functionality of these powerful tools appears to the application programmer as an extension of the Tcl language. Thus, the CGI scripts required for evaluating and translating records are easily written in a special installation of Tcl enhanced by the Grammar Builder libraries.

4.2 Record Evaluation

The bibliographic record being developed can be evaluated at three levels in the system. First, the design of the HTML forms ensures a minimal level of correctness by requiring the user to enter data for certain fields. Second, at the user's request, the Grammar Builder can act as a standard SGML validator, checking the record against the document type definitions and reporting errors back to the user. Finally, some semantic checks are performed in a process following the Grammar Builder's evaluation. This process makes simple checks, such as verifying that the title and author data elements contain some alphanumeric characters.

Two more checks are planned for future versions of Spectrum. The record will be assigned a score indicating whether it has the minimal information required to qualify as a URC, TEI record, or MARC record. This score will be available as a searchable index in the Spectrum database, allowing users to filter search results according to standards of completeness appropriate to their needs. Note that this is an evaluation of the bibliographic record, not an evaluation of the resource being described. Proposals such as the Seal of Approval, discussed in the URC literature (Daniel 1994), are beyond the scope of the Spectrum project.

The evaluation will also ensure that all proper names given in a resource description are standardized. Without standardization, records containing names such as "Worldwide Web", "World Wide Web", "The Web", and various misspellings could not be resolved to the same name, causing search results from the Spectrum database to be inaccurate. The library community has long understood the problem of name standardization in the description of bibliographic data and can contribute expertise necessary to address it.

4.3 Record Translation

Three considerations guided the design of the record translation process. First, we assume that there will be volatility in the specification of data elements for some of the record types that we plan to generate. Accordingly, the details of the translations are recorded in an easily changed scripting language that is an input to the Grammar Builder software. Second, we consider our goal to be the mapping of semantically similar elements found in the major record formats proposed for the description of electronic resources. It is not in our interest to add to an already confusing discussion by proposing new standards.

Finally, we expect most of our records to be contributed by users of the Spectrum system, but we must be flexible enough to accept input from other sources. Several projects are under way in the library community that promise to make valuable contributions to the repository of TEI or MARC records describing electronic resources. For example, The Internet Cataloging Project, funded by the U.S. Department of Education and managed at OCLC, aims to enlist the cooperation of librarians in the creation of a searchable database of USMARC records that describe Internet-accessible materials (URL: http://www.oclc.org/oclc/man/catproj/catcall.htm). Spectrum project staff will also seek to work closely with institutions engaged in projects that use the TEI header as the source for cataloging information such as the Center for Electronic Texts in the Humanities (Hockey 1993, Horowitz 1994) and the University of Virginia Libraries (UVA). See Gaynor (1994) for an account of UVA's effort to use the TEI header to provide bibliographic control and access to electronic texts.

Translations from the User Data File

Our analysis so far has revealed that there is enough overlap between the URC, TEI and MARC records to create a set of data elements, the Item Description Structure, from which minimal versions of all three records can be generated as views. The Item Description Structure is intended to be a pure intersection of the three record types, but this is currently not quite so. We had to include data elements required for recording the location of an electronic resource even if these elements are missing from one of the record types. For example, the Item Description Structure contains a field for encoding the URL, but this data can be added to the TEI record only in an unstructured note field. Perhaps a future specification of the TEI record will contain an explicit field for the URL.

The Item Description Structure is written to the User Data File as the user works on a record. A fragment of the User Data File--as well as the corresponding views as a URC, a TEI record and a MARC record--are given in Figure 4.1. The data elements are enclosed in SGML tags and are organized in a way that reflects the organization of the Item Description forms. This SGML markup is sufficient to produce the three output formats currently defined in the Spectrum system. Only those data elements that are common to all three formats are included in this example.

The data in the Item Description Structure in Figure 4.1 is a record of a Spectrum session in which a user attempted to describe the same intellectual work that the TEI record in Figure 2.2 describes. The most complex data is the access information. All three locations are recorded: the URL and the two FTP sites for the different versions of the Postscript files.

We must point out that the TEI and MARC records in Figure 4.1 are syntactically correct but illegal records because they lack required publisher data. Spectrum would assign such a record a score indicating that it is adequate for a URC, but inadequate as a TEI or a MARC record. A user desiring to create a TEI or MARC record from this data would be alerted to the problem and would be given a chance to "upgrade" the record.


Figure 4.1 The Item Description Structure and Some Simple Transforms

a. Item Description Structure

<descriptive_info>
    <title>Assessing Information on the Internet...</title>
    <author>Martin Dillon, et al</author>
</descriptive_info>
<access_info>
    <url>URL:http://www.oclc.org/oclc/menu/reschdoc.htm</url>
</access_info>
<access_info>
    <access_method>ftp</access_method>
    <hostname>ftp.rsch.oclc.org</hostname>
    <path>ftp/pub/internet_resources_project/report</path>
    <file><name>cover.ps</name><size>9679 bytes</size></file>
    <file><name>internet.ps</name><size>257990</size></file>
    <file><name>appenda.ps</name><size>84957</size></file>
    <file><name>appendb.ps</name><size>66017</size></file>
    <file><name>appendc.ps</name><size>37973</size></file>
    <file><name>appendd.ps</name><size>46106</size></file>
    <file><name>appende.ps</name><size>351941</size></file>
</access_info>
<access_info>
    <access_method>ftp</access_method>
    <hostname>ftp.rsch.oclc.org</hostname>
    <path>ftp/pub/internet_resources_project/report</path>
    <file><name>report.ps.tar.Z</name><size>312328</size></file>
</access_info>

b. MARC Record

100   Martin Dillon, et al.
245   Assessing Information on the Internet...
856 7 $u http://www.oclc.org/oclc/menu.reschdoc.htm
856 1 $a ftp.rsch.oclc.org $d ftp/pub/internet_resources_project/report
      $f cover.ps $s 9679 bytes $f internet.ps $s 257990 bytes
      $f appenda.ps $s 84957 bytes $f appendb.ps $s 66017 bytes
      $f appendc.ps $s 37973 bytes $f appendd.ps $s 46106 bytes
      $f appende.ps $s 351941 bytes
856 1 $a ftp.rsch.oclc.org $s ftp/pub/internet_resources_project/report
      $f report.ps.tar.Z $s 312328 

c. TEI Header

<teiHeader>
  <fileDesc>
     <titleStmt>
	    <title>Assessing Information on the Internet...</title>
	    <author>Martin Dillon, et al.</author>
     </titleStmt>
     <notesStmt>
       <note>856 7 $u http://www.oclc.org/oclc/menu.reschdoc.htm</note>
       <note>856 1 $a ftp.rsch.oclc.org
             $d ftp/pub/internet_resources_project/report
             $f cover.ps $d 9679 bytes $f internet.ps $s 257990 bytes
             $f appenda.ps $s 84957 bytes $f appendb.ps $s 66017 bytes
             $f appendc.ps $s 37973 bytes $f appendd.ps $s 46106 bytes
	     $f appende.ps $s 351941 bytes </note>
       <note>856 1 $a ftp.rsch.oclc.org
                 $d ftp/pub/internet_resources_project/report
		 $f report.ps.tar.Z $s 312328</note>
       <note>856 1 $u ftp://ftp.rsch.oclc.org/pub
			     /internet_resources_project/report</note>
      </notesStmt>
  </fileDesc>
</teiHeader>

d. URC (hypothetical)

URN: Universal Resource Name (not available at this time)
Title: Assessing Information on the Internet...
Author: Martin Dillon, et al.
URL: http://www.oclc.org/oclc/menu.reschdoc.htm
Content-Length: <null></null>
URL: ftp://ftp.rsch.oclc.org/pub/internet_resources_project/report
Content-Length: 9679, 257990, 84957, 66017, 37973, 46106, 351941
URL: ftp://ftp.rsch.oclc.org/pub/internet_resources_project/report
/report.ps.tar.Z
Content-Length: 312328


Records are translated to the desired view by calling the Grammar Builder's translator. It takes SGML-tagged input, induces the grammar, and produces whatever output is required for the application at hand. The translator is the same as that used to convert SGML to TeX and SGML to HTML in the project reported in Weibel et al. (1994). As is described in detail in Shafer and Thompson (1995), a translation statement for the Grammar Builder consists of a condition and an action; a translation script is a set of translation statements. Most of the mappings in Figure 4.1 are simply lexical substitutions, even the tree structure that must be constructed for the TEI record. For example, the <title> tag in the Item Description Structure is translated to "245" in the MARC record and "Title:" in the URC with the following translation rules. The Literal command in the Action portion of the statement specifies that the mapped line consists of the literal text string in parentheses followed by the data originally enclosed in the SGML tag given in the Start_Tag condition.

       Condition           Action    
       Start_Tag(title)    Literal("245 ")
       Start_Tag(title)    Literal("Title: ")                
The most complex translation is required in the construction of the 856 fields from the <access_info> data in the Item Description Structure, but close inspection reveals that this is also just a series of lexical substitutions. Since the <access_info> elements in the Item Description Structure data elements are in the same order as those in the 856 field, the following translation statements generate the desired result. The only new element here is the Ancestor condition. This allows the writer of the translation script to refer unambiguously to a data element that may be embedded arbitrarily deep in a tree structure.
    if (Ancestor(access_info))
       {
       Start_Tag(url)                Literal("856 7 $u ")
       Start_Tag(access_method)      Literal("856 1 ")
       Start_Tag(hostname)           Literal("$a ")
       Start_Tag(path)               Literal("$d ")
       if (Ancestor(file))
          {
          Start_Tag(name)            Literal("$f ")
          Start_Tag(size)            Literal("$s ")
          }
       }
Though the mappings of the data elements in the Item Description Structure into the URC, TEI and MARC records are straightforward using the Grammar Builder scripting language, this simple example raises some issues involving the definition of the three standards. The location fields are especially problematic. For example, the second URL in the URC record is of limited usefulness. The data in the URL field is not literally a pointer to a file, but a pointer to a directory containing seven Postscript files. But since the current URC guidelines specify only that file sizes get recorded, not filenames, the pointers to the actual files are lost. Another problem is that only the MARC record contains an explicit data element instructing users how to handle the files once they are retrieved. This important information instructing users to apply the Unix untar and uncompress utilities to the files is recorded as an optional note in the librarian-created TEI record in Figure 2.2, and is missing altogether in the URC specification.

A TEI-to-MARC Translation

A more complex mapping is the translation of a complete TEI record to a MARC record. Many of the transformations would be difficult to perform in a general way without exploiting the knowledge of the source document's grammar that the Grammar Builder translator makes available. To illustrate, consider some of the mappings between the TEI record in Figure 2.2 and the MARC record generated by Spectrum in Figure 4.2.


Figure 4.2 Automatically generated MARC record corresponding to the TEI record in Figure 2.2
090   RK5105.875.I57
092   00467 $2 20
100   Martin Dillon
245   Assessing information on the internet $h [computer file] : toward providing library 
      services for computer mediated communication
260   Dublin, Ohio : $b OCLC Online Computer Library Center, Inc., Office of Research
      $c 1994
650   Internet (Computer network)
650   Cataloging of computer files
650   Information networks
650   Libraries--Communication systems
650   Information storage and retrieval systems
650   Library information networks
700   Erik Jul
700   Mark Burge
700   Carol Hickey
856 7 $u http://www.oclc.org/oclc/menu.reschdoc.htm
856 1 $a ftp.rsch.oclc.org $d ftp/pub/internet_resources_project/report
      $f cover.ps $s 9679 bytes $f internet.ps $s 257990 bytes
      $f appenda.ps $s 84957 bytes $f appendb.ps $s 66017 bytes
      $f appendc.ps $s 37973 bytes $f appendd.ps $s 46106 bytes
      $f appende.ps $s 351941 bytes
856 1 $a ftp.rsch.oclc.org $s ftp/pub/internet_resources_project/report
      $f report.ps.tar.Z $s 312328

The 260 field, publisher data, is assembled from the <publisher>, <date> and <address> fields in the<publicationStmt>. In the 092 field, the Dewey Decimal Classification, the data comes from the <classCode> element in the <profileDesc>field of the TEI header; the data in the $2 subfield is the value of the attribute in that field. The 245 field, author data, comes from the data enclosed in the first <author> tag in the <fileDesc> subsection of the TEI header; the data from the remaining author fields in this portion of the TEI header may be formatted as 700 fields. It is necessary to distinguish this <author> field from the one in the <sourceDesc> subsection of the TEI header, which may be formatted into a 500 field, general note, in the MARC record--or omitted, as in this example.

The 650 fields, representing the Library of Congress Subject Headings for this record, come from the <item> fields in the <textClass> subsection of the TEI header. The data from these fields can be extracted in a straightforward way, but a MARC-compliant record would require the consultation of the Library of Congress Subject Authority file for appropriate subfield tags and codes.

In sum, the automatic translation of bibliographic records from TEI to MARC formats involve a wide range of operations on tree-structured data. Data extracted from fields in the TEI record may be joined, split, inserted or omitted. It is sometimes necessary to refer to the ancestors of the TEI tags containing data to be transferred to the MARC record. Finally, the extracted fields must be sorted to conform to cataloging and MARC format standards.

Since the Grammar Builder software can easily handle the TEI-to-MARC translation, the most complex mapping in the Spectrum system, we are in a position to move quickly to the major issues arising from the description of electronic resources. For example, what is a "document" in the electronic media--an electronic facsimile of a book or article, or a pointer to a set of hypertext experiences? The real-world example we have described in this article illustrates the genuine confusion that users have about this question. Of the three URLs in our example, two of them point to electronic versions of printed material. The other URL points to a standard Web page that seems intuitively to be the "front page" of the document. However, this Web page is not the work whose intellectual contents are described in the TEI and MARC records, but a table of contents to that work--nothing more than a convenient interface for downloading or viewing the electronic files that correspond to the printed work.

4.4 Database Creation and Access

As Figure 3.1 shows, the Spectrum system includes a means for storing and retrieving records. Once the user has indicated that the record is suitable for the catalog of Internet resources, the record is automatically prepared for inclusion in a full-text database using OCLC's Newton database management system. The record is then accessible to a Mosaic user via OCLC's WebZ Server (Weibel et.al. 1994), a replacement to the standard HTTPD server that maintains a database session conforming to the Z39.50 information retrieval protocol. In a later version of the Spectrum project, the WebZ Server will be used for record creation as well as record retrieval, eliminating the NCSA HTTPD Server. The NCSA Server is currently necessary because the WebZ Server does not yet support the CGI protocol.

5. Future Research

Future research efforts will focus on a Common Client Interface version of Spectrum capable of interacting with locally resident software tools, such as spelling checkers, word processors, general dictionaries and thesauri. Other enhancements include a module for eliciting user-supplied subject categorizations, and automatic indexing of textual electronic resources.

References:

[Chan94]
Chan, Lois Mai. 1994. Cataloging and Classification. New York:McGraw-Hill, Inc.
[Daniel94]
Daniel, Ron. 1994. Proposed URC External Representation. Accessible at URL: http://www.acl.lanl.gov/URI/ExtRep/urc0.html
[Desai94]
Desai, Bipin. 1994. The Semantic Header and Indexing and Searching on the Internet. Accessible at URL: http://www.cs.concordia.ca/~faculty/bcdesai/cindi-system-1.0.html
[Dillon93]
Dillon, M. et al. 1993. Assessing information on the Internet: Toward providing library services for computer-mediated communication. Dublin, Ohio: OCLC. OCLC/OR/RR-93/1.
[Gaynor94]
Gaynor, Edward. 1994. "Cataloging Electronic Texts: The University of Virginia Library Experience." Library Resources and Technical Services 38(4): 403-413 (October 1994).
[Giordano94]
Giordano, Richard. 1994. "The Documentation of Electronic Texts Using Text Encoding Initiative Headers: An Introduction." Library Resources and Technical Services 38(4): 389-401 (October 1994).
[Hockey93]
Hockey, Susan. 1993. "Developing Access to Electronic Texts in the Humanities." Computers in Libraries 38 (2): 41-43 (February 1993).
[Horowitz94]
Horowitz, Lisa R. 1994. CETH Workshop on Documenting Electronic Texts. CETH Technical Report, no. 2. New Brunswick, N.J.: CETH.
[Ousterhout94]
Ousterhout, John K. 1994. Tcl and the Tk Toolkit. New York: Addison-Wesley Publishing Company.
[Shafer94]
Shafer, Keith. 1994. "Fred: The SGML Grammar Builder." Accessible at URL: http://www.oclc.org/fred/
[Shafer and Thompson]
Shafer, Keith and R. Thompson. 1995. "Introduction to Translating Tagged Text via the SGML Document Grammar Builder Engine." Accessible at URL: http://www.oclc.org/fred/docs/translations/intro.html
[Sperberg94]
Sperberg-McQueen, C. M., and Leu Burnard, ed. 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: Text Encoding Initiative.
[Weibel94]
Weibel, Stuart et al. 1994. "An Architecture for Scholarly Publishing on the World Wide Web." Proceedings from the Second International WWW Conference: Mosaic and the Web, 1994. 739-748.