Sourcerer: Thesaurus-Assisted
Automated Source Identification
for the World-Wide Web

R. P. C. Rodgers
Suresh Srinivasan
Lister Hill National Center for Biomedical Communications

Jim Fullton
Clearinghouse for Networked Information Discovery and Retrieval

Abstract
Introduction
Sourcerer Architecture
Prototype Walk-Through
Discussion
Acknowledgments
References

Note: this document contains a HTML table; as of the time of writing, not all available Web clients are capable of formatting this properly.

Abstract

Sourcerer® is a system for automated location and retrieval of information available via World-Wide Web (WWW). It is a component of the Information Sources Map (ISM) Project, which in turn is part the Unified Medical Language System® (UMLS®), a multi-year Project of the U.S. National Library of Medicine (NLM). Sourcerer builds upon the three UMLS resources: the ISM database (each record of which contains a library catalog-like description of an information source), the UMLS Metathesaurus®, (the 1994 version of which contains over 298,000 terms corresponding to over 152,000 distinct concepts, and unifies 30 separate defined vocabularies), and the UMLS semantic network (which tags thesaurus terms by semantic type, and lays out potentially meaningful relationships between semantic types).

A forms-based entry tool, Apprentice,® allows an information provider to remotely create a structured description of an information source. A human cataloger then adds index information using the NLM MeSH® vocabulary as well as the UMLS semantic network, and adds the description to the Information Sources Map (ISM) database.

Sourcerer is a Common Gateway Interface (CGI) application that acts as a client to four different types of servers:

UMLS Knowledge Source (Metathesaurus and semantic network) servers.
The Information Sources Map (ISM) server.
A Uniform Resource Name (URN) to Uniform Resource Locator (URL) resolver.
One or more terminal information servers (TIS).

A user query is interactively refined with the assistance of the UMLS Knowledge Source servers, which is also used to map query terms into high-level concepts from the UMLS semantic network. These high-level concepts are used to retrieve a list of potentially useful information sources from the database on the ISM server, returned in the form of URNs. URNs are then resolved to URLs by another server. The user can then screen the proposed sources or connect to them automatically. In one instance, the original user query is passed along to the TIS using TIS-specific syntax.

Introduction

Librarians and information scientists often evaluate the the success of an attempt to retrieve information by means of two quantitative measures: precision (the fraction of information retrieved that is relevant to the query) and recall (the fraction of all relevant information that is actually retrieved). These criteria have been frequently been applied to two distinct strategies for information retrieval:

The use of natural language, in which textual information is indexed using its own contents. This can require rather large computational resources.
The use of defined vocabularies, in which carefully defined topical vocabularies are used to index the matter to be retrieved. This is often less consumptive of computer resources, but requires human indexing.

Tools such as WAIS and various Web indexing services employ the first of these strategies. MEDLINE, one of the world's largest and oldest computerized bibliographic databases, employs the second. MEDLINE, created and maintained by the U.S. National Library of Medicine (NLM), contains bibliographic descriptions and abstracts of the world's major biomedical journals, in many different languages. Entries in the MEDLINE database are indexed by highly trained personnel using a defined vocabulary known as MeSH (Medical Subject Headings®).

The primary constituency of the NLM, health care workers and biomedical reseach workers, generally approach MEDLINE with crisply defined information needs and tight time constraints. Accrued experience at the NLM suggests that retrieval systems based on the use of defined vocabularies are more successful in this setting than systems based on the natural language contents of documents [Lancaster & Warner, 1993].

The use of defined vocabularies has spread widely within medicine; in fact, the proliferation of multiple defined vocabularies is regarded by some workers as a impediment to the creation of univerally useable computerized medical records. To counter the centrifugal effects of multiple vocabularies, the NLM has been engaged in a major multi-year research and development effort known as the Unified Medical Language System (UMLS) Project [Humphreys & Lindberg, 1993]. The UMLS has three principal components, known collectively as the UMLS Knowledge Sources (the numbers below refer to the 1994 edition):

The UMLS Metathesaurus contains over 298,000 terms (including lexical variants and synonyms) relating to over 152,000 distinct concepts, drawn from 30 separate defined biomedical vocabularies (see Table 1).
The UMLS semantic network [McCray Srinivasan, & Browne, 1994] contains 133 semantic types and 50 semantic relations. Semantic types can be thought of as high-level abstract descriptions of the meaning of concepts appearing in the Metathesaurus; every Metathesaurus concept is tagged with one or more semantic types. Semantic relations can be thought of as verbs describing the sorts of relationships that can occur between semantic types. The semantic network comprises the set of types and the relations connecting them, and is too complex to be represented visually in its entirety, although parts of it can be diagrammed, as in Figure 1.
The ISM is an automated catalog of information resources.

For purposes of the ISM, "information resource" is defined broadly, encompassing virtually any electronically accessible repository, including bibliographical databases, full text collections, image archives, and expert systems. Initially just a database of information resources, the ISM is now acquiring (through the work described here) procedural elements that allow it to be applied to search-and-retrieval. A forms-based entry tool, Apprentice,® allows an information provider to create a structured description of an information source. Such a record contains fields for free-text descriptions of the source, as well as fields describing: the identity and location of the original information gatherer; the identity and location of the information provider; the nature of database content; database language(s); alternate database names; the nature of the intended audience; database size; update frequency; number of records added annually; relationships between this database and other databases (including subset/superset relationships, and relationships between specific fields in distinct databases), and; sample interactions. A human cataloger adds additional index information using the NLM MeSH vocabulary, as well as semantic types and semantic type relationships drawn from the UMLS semantic network.

The UMLS Knowledge Sources are updated annually, and released on CD-ROM, under the terms of a beta test agreement, for purposes of research. A network-accessible server that provides access to UMLS data is under development, and is used by Sourcerer.

Table 1. List of constituent vocabularies of the UMLS Metathesaurus (Meta-1.4, 1994).

Abbrev. Name Owner/Originator Purpose
ACR92 Index for Radiological Diagnoses American College of Radiology (1986) Radiology and ultrasound

AIR92 AI/RHEUM NLM (1992) Rheumatology

COS89 COSTAR (COmputer-Stored Ambulatory Records) Massachusetts General Hospital (1989) Outpatient records

COS92 COSTAR (COmputer-Stored Ambulatory Records) Massachusetts General Hospital (1992) Outpatient records

CSP93 CRISP Thesaurus NIH (1993) Biomedical research

CPT89 Physicians' Current Procedural Terminology (CPT) AMA (1989) Physicians' billing

CST93 COSTART: Coding Symbols for Thesaurus of Adverse Reaction Terms FDA (1993) Drug reactions

DOR27 Dorland's Medical Dictionary, 27th Ed. Saunders General medical terminology

DSM3R Diagnostic and Statistical Manual of Mental Disorders (DSM) American Psychiatric Association (1987) Psychiatry

DXP92 DXplain, an expert diagnosis program Massachusetts General Hospital (1992) General medicine

HHC93 Home Health Care Classification of Nursing Diagnoses and Interventions (HHC) Georgetown Univ. (1993) Nursing

ICD89 International Classification of Diseases, 9th Rev., 3rd Ed. (ICD-9) HCFA (1989) General medicine

ICD91 International Classification of Diseases, 9th Rev., 4th Ed. HCFA (1991) General medicine

INS94 Thesaurus Biomedical Français/Anglais (French MeSH) INSERM (1993) Biomedical research and clinical

LCH90 Library of Congress Subject Headings (LCSH) LOC (1989) General knowledge

MCM92 List of Epidemiology Terms McMaster University (1992) Epidemiology

MIM93 Online Mendelian Inheritance in Man Johns Hopkins Univ. (1993) Human genetics

MSH94 Medical Subject Headings (MeSH) NLM (1994) Biomedical research and clinical

MSH94 Medical Subject Headings (Supplementary Chemical Terms) NLM (1994) Biomedical research and clinical

MTH Metathesaurus NLM (1994) Unites biomedical vocabularies

NAN92 Classification of Nursing Diagnoses 9th Conference on the Classification of Nursing Diagnoses(1992) Nursing

NEU Neuronames Brain Hierarchy Univ. of Washington Brain AnatomyTD>

NIC93 Nursing Interventions Classification Iowa Intervention Project (1993) Nursing

PDQ93 Physician Data Query Online System NCI (1993) Oncology

SNM2 Systematized Nomenclature of Medicine (SNOMED II) College of American Pathologists (1979, 82) Human pathology

SNMI SNOMED International College of American Pathologists (1979, 82) Human pathology

SNM3 Systematized Nomenclature of Human and Veterinary Medicine College of American Pathologists, American Veterinary Medicine Association (1993) Human and veterinary pathology

UMS94 Universal Medical Device Nomenclature System (UMDNS): Product Category Thesaurus ECRI (1994) Medical devices

CRISP U.S. P.H.S. Thesaurus for indexing scientific projects U.S. Public Health Service Biomedical research

UWA92 Primate Information Center Data Univ. Washington (1992) General medical

Abbrev.	Name	Owner/Originator	Purpose
ACR92	Index for Radiological Diagnoses	American College of Radiology (1986)	Radiology and ultrasound
AIR92	AI/RHEUM	NLM (1992)	Rheumatology
COS89	COSTAR (COmputer-Stored Ambulatory Records)	Massachusetts General Hospital (1989)	Outpatient records
COS92	COSTAR (COmputer-Stored Ambulatory Records)	Massachusetts General Hospital (1992)	Outpatient records
CSP93	CRISP Thesaurus	NIH (1993)	Biomedical research
CPT89	Physicians' Current Procedural Terminology (CPT)	AMA (1989)	Physicians' billing
CST93	COSTART: Coding Symbols for Thesaurus of Adverse Reaction Terms	FDA (1993)	Drug reactions
DOR27	Dorland's Medical Dictionary, 27th Ed.	Saunders	General medical terminology
DSM3R	Diagnostic and Statistical Manual of Mental Disorders (DSM)	American Psychiatric Association (1987)	Psychiatry
DXP92	DXplain, an expert diagnosis program	Massachusetts General Hospital (1992)	General medicine
HHC93	Home Health Care Classification of Nursing Diagnoses and Interventions (HHC)	Georgetown Univ. (1993)	Nursing
ICD89	International Classification of Diseases, 9th Rev., 3rd Ed. (ICD-9)	HCFA (1989)	General medicine
ICD91	International Classification of Diseases, 9th Rev., 4th Ed.	HCFA (1991)	General medicine
INS94	Thesaurus Biomedical Français/Anglais (French MeSH)	INSERM (1993)	Biomedical research and clinical
LCH90	Library of Congress Subject Headings (LCSH)	LOC (1989)	General knowledge
MCM92	List of Epidemiology Terms	McMaster University (1992)	Epidemiology
MIM93	Online Mendelian Inheritance in Man	Johns Hopkins Univ. (1993)	Human genetics
MSH94	Medical Subject Headings (MeSH)	NLM (1994)	Biomedical research and clinical
MSH94	Medical Subject Headings (Supplementary Chemical Terms)	NLM (1994)	Biomedical research and clinical
MTH	Metathesaurus	NLM (1994)	Unites biomedical vocabularies
NAN92	Classification of Nursing Diagnoses	9th Conference on the Classification of Nursing Diagnoses(1992)	Nursing
NEU	Neuronames Brain Hierarchy	Univ. of Washington	Brain AnatomyTD>
NIC93	Nursing Interventions Classification	Iowa Intervention Project (1993)	Nursing
PDQ93	Physician Data Query Online System	NCI (1993)	Oncology
SNM2	Systematized Nomenclature of Medicine (SNOMED II)	College of American Pathologists (1979, 82)	Human pathology
SNMI	SNOMED International	College of American Pathologists (1979, 82)	Human pathology
SNM3	Systematized Nomenclature of Human and Veterinary Medicine	College of American Pathologists, American Veterinary Medicine Association (1993)	Human and veterinary pathology
UMS94	Universal Medical Device Nomenclature System (UMDNS): Product Category Thesaurus	ECRI (1994)	Medical devices
CRISP	U.S. P.H.S. Thesaurus for indexing scientific projects	U.S. Public Health Service	Biomedical research
UWA92	Primate Information Center Data	Univ. Washington (1992)	General medical

The goal of a good search is to maximize precision and recall. The burgeoning number of biomedical information sources available via the Internet has broadened the search problem beyond that of maximizing precision and recall within a specific database; the problem now becomes two-tiered: first, identify appropriate information sources; second, search those sources effectively. The browsing paradigm and word-based indexing schemes offered by WWW, gopher, and WAIS in their currently most commonly encountered forms do not suffice to address the retrieval needs of clinicians and biomedical researchers. The Sourcerer Project is attempting to address this issue through the application of UMLS-based tools and appropriate Internet-based standards.

Figure 1. One view from the UMLS semantic network. This diagram demonstrates multiple semantic type relations, including one that asserts that the semantic type "Anatomical Structure" is related to semantic type "Organism" by the semantic relation "part of."

[intermediate; 460 x 324 pixels, 18339 bytes]
[full-size; 1152 x 812 pixels, 62433 bytes]

Sourcerer Architecture

Sourcerer is a Common Gateway Interface (CGI) application, implemented behind the NCSA Web server, httpd. It is written primarily in the perl scripting language, with some modules written in the C programming language, and runs on a Sun workstation under the Solaris 2.3 (UNIX) operating system. The user interacts with it via any forms-capable Web client. Sourcerer acts as a client to four different types of network-based servers, as diagrammed in Figure 2:

UMLS Knowledge Source (Metathesaurus and semantic network) servers. This server returns any Metathesaurus and semantice network data associated with concepts that are associated with user-entered text strings.
The Information Sources Map (ISM) server returns records describing information resources. The resources are described by database records that include free text descriptions of content, and various categorical and nominal descriptors such as type of information content, size, and dates of coverage. The records are also indexed using MeSH, semantic types, and sematic type relations,
A Uniform Resource Name (URN) to Uniform Resource Locator (URL) resolver.
One or more terminal information servers (TIS).

Figure 2. Architectural diagram of Sourcerer.

[intermediate; 454 x 443 pixels, 9308 bytes]
[full-size; 858 x 836 pixels, 27275 bytes]

Prototype Walk-Through

The capabilities of the current Sourcerer prototype are best illustrated by stepping through an actual search. The prototype is intentionally didactic, illustrating the intermediate stages of the search-and-retrieval process. It is not presented as an example of optimal user interface design.

Figure 3 illustrates the top-level access document for Sourcerer, which allows access to servers for the ISM and semantic network, in addition to the Sourcerer prototype.

Figure 3. The Sourcerer home page.

[intermediate; 456 x 447 pixels, 18645 bytes]
[full-size; 913 x 894 pixels, 26666 bytes]

In stage 1 of the process of using Sourcerer, illustrated in Figure 4, the user has entered a query into the Sourcerer search form. Striving for simplicity and following other design precedents within NLM, the form presents only three text subwindows, each of which can contain a multi-word string describing a single biomedical concept. Although the concepts windows are connected by radiobuttons for the Boolean operations AND and OR, the current prototype employs only AND in its searches.

The forms are designed so that they are used in a top-down fashion. The upper part of the form includes instructions to the user, and various form-based widgets in which information can be entered. This is followed by a section in which the user can specify what action is to be taken with the supplied information; in the example of Figure 4, the default action is to proceed with a search of the UMLS Metathesaurus using the user-entered concepts.

In the example, the word "aspirin" has been entered into the first concept window, and the phrase "bleeding time" into the second. The bleeding time [Rodgers & Levin, 1990]. is a medical diagnostic procedure in which a superficial incision of controlled dimensions is made on the forearm of a subject, and the time to cessation of bleeding recorded. It is thought to reflect both capillary physiology and the function of platelets, small cell products in the blood that participate in the early stages of clot formation. Platelet function is inhibited by aspirin; the user is trying to learn more about this interaction.

Figure 4. Stage 1: specify search expression.

[intermediate; 456 x 447 pixels, 17690 bytes]
[full-size; 913 x 894 pixels, 23126 bytes]

Figure 5 presents Stage 2 in the use of Sourcerer, in which the user reviews the results of having consulted the UMLS Metathesaurus. In this instance, Sourcerer has found one and only one concept matching each of the two character strings entered in Stage 1. The Metathesaurus server returns matching concepts, as well as the semantic types associated with them (none of this is shown to the viewer at this stage; we will examine the information that came back from this search later in the walk-through). In requesting to proceed to Stage 3, the user triggers a consultation with the UMLS semantic network.

Figure 5. Stage 2: review Metathesaurus search results.

[intermediate; 456 x 447 pixels, 15879 bytes]
[full-size; 913 x 894 pixels, 22992 bytes]

In Stage 3 (Figure 6), Sourcerer has used the semantic types presented in Stage 2 to consult the UMLS semantic network. The user is presented with a list of non-verb-noun triples (known as semantic type relationships), based on allowed relationships between the semantic types associated with the original query concepts. For maximal clarity to the user, these are presented in terms of the original query strings rather than the underlying semantic types. The user selects any of the triples that are deemed applicable to the query. In this case, the phrase "bleeding time assesses_effect_of aspirin" has been selected. Because a variant of the bleeding time test determines the bleeding time both before and after a challenge dose of aspirin, the phrase "bleeding time uses aspirin" has also been selected. The default action is to proceed to stage 4 after a search of the ISM database.

Figure 6. Stage3: identify semantic relationships.

[intermediate; 456 x 447 pixels, 16873 bytes]
[full-size; 913 x 894 pixels, 24197 bytes]

In Stage 4 (Figure 7), the user is presented with a list of information resources which have been deemed appropriate to the original query. This list is derived from a search of the ISM database using the semantic types, semantic type relationships, and MeSH headings derived from the user query in earlier stages.

Figure 7. Stage 4: review/search identified sources (beginning of display).

[intermediate; 456 x 447 pixels, 18920 bytes]
[full-size; 913 x 894 pixels, 26169 bytes]

In Figure 8, the user has selected one of the sources from the list presented in (Figure 7), MEDLINE. This results in a formatted display of part of the ISM database record for MEDLINE.

Figure 8. The beginning of the MEDLINE ISM record display, obtained by selecting the MEDLINE anchor appearing in Figure 7.

[intermediate; 456 x 447 pixels, 14979 bytes]
[full-size; 913 x 894 pixels, 21097 bytes]

Figure 9 shows a continuation of the formatted display of the partial ISM database record for MEDLINE (in this case, the free-text Definition and General Description fields; most of the fields that occur in an ISM record are not shown in this display);

Figure 9. The definition and general description fields of the MEDLINE ISM record display.

[intermediate; 456 x 447 pixels, 27385 bytes]
[full-size; 913 x 894 pixels, 37668 bytes]

We now return to exploring the information that was obtained in earlier stages through consulting the UMLS Metathesaurus and semantic network. Employing the "Back" button of the Web client to return to the list of sources first shown in Figure 7, then scrolling down to the bottom of the document, selecting "View Search Information" as the action to be performed (see Figure 10), and then clicking on the "Execute Action" button produces the display shown in Figure 11.

Figure 10. Stage 4: review/search identified sources (bottom of display shown in Figure 7). The "View Search Information" action button has been selected, in order to display the UMLS information retrieved in the early stages of this use of Sourcerer.

[intermediate; 456 x 447 pixels, 16112 bytes]
[full-size; 913 x 894 pixels, 23019 bytes]

This results in the display (Figure 11) of the information that was accrued in the earlier stages of Sourcerer, through consulting the Metathesaurus and semantic network. The user-entered string "bleeding time" matched one and only one Metathesaurus concept, the concept name for which is also "bleeding time." This display also shows the Metathesaurus unique identifier for this concept, its associated Semantic Type ("Diagnostic Procedure"), Mesh tree number (MeSH is organized as a set of topical hierarchical trees) and MeSH definition.

Figure 11. The top of search information summary display, presenting information about the Metathesaurus concept "Bleeding Time."

[intermediate; 456 x 447 pixels, 17061 bytes]
[full-size; 913 x 894 pixels, 23260 bytes]

Scrolling down in the display shown in Figure 11 (see Figure 12) reveals the UMLS information that was retrieved in connection with the user-entered string "aspirin." The Metathesaurus returned: one and only one concept (the name for which is "Aspirin"); an extensive list of synonyms; two associated Semantic Types ("Organic Chemical" and "Pharmacological Substance"); and, multiple Mesh tree numbers (this part of the output is truncated in Figure 12).

Figure 12. The middle section of the search information summary display, presenting information about the Metathesaurus concept "Aspirin."

[intermediate; 456 x 447 pixels, 13425 bytes]
[full-size; 913 x 894 pixels, 19373 bytes]

The listed semantic types are associated with Web anchors. Selecting "Organic Chemical" from the display shown in Figure 12 leads to a display of information from the semantic network (including a unique identifier, name, tree number, and definition), as shown in Figure 13.

Figure 13. Semantic network record for semantic type "Organic Chemical," obtained by selecting the anchor shown in Figure 12.

[intermediate; 456 x 447 pixels, 16558 bytes]
[full-size; 913 x 894 pixels, 23281 bytes]

Clicking on the "Back" button of the display shown in Figure 13 returns us to the summary of accrued search information already visited in Figures 11 and 12. Scrolling down further in this document (see Figure 14)) reveals not only the complete list of Mesh headings truncated in Figure 12, but also the list of potentially valid semantic relationships that was determined from consulting the UMLS semantic network. The semantic types shown here were mapped into the corrsponding user-specific strings for the display shown in Figure 6.

Figure 14. The bottom section of the search information summary display, presenting the list of potentially meaningful semantic relations determined by consulting the UMLS semantic network.

[intermediate; 456 x 447 pixels, 18151 bytes]
[full-size; 913 x 894 pixels, 25403 bytes]

Clicking on the "Back" button returns us to the display shown in Figures 7 and 10. Selecting the MEDLINE anchor (see Figure 7) returns to the display shown in Figure 8. Selecting the first of the two URLs thus displayed, results in the display of Figure 15.

Figure 15. The MEDLINE search page, reached by selecting the topmost URL appearing in Figure 8.

[intermediate; 456 x 447 pixels, 8921 bytes]
[full-size; 913 x 894 pixels, 13726 bytes]

This is an experimental Web-based front-end to MEDLINE, the premier NLM bibliographic database. Known provisionally as NetCoach [Kingsland, Syed & Harbourt, 1994], this CGI application utilizes artificial intelligence methodologies developed for an earlier PC-based system, known as Coach [Harbourt, Syed, & Kingsland, 1993]. Netcoach uses UMLS and MeSH to provide assistance in searching the MEDLINE database. The initial Netcoach form requires entry of the user's account identifier and password (Figure 16). Selecting the button in Figure 15 leads to the NetCoach password entry form of Figure 16.

Figure 16. The NetCoach password entry form.

[intermediate; 456 x 447 pixels, 18933 bytes]
[full-size; 913 x 894 pixels, 27159 bytes]

Upon entering a user identification and password and selecting the "Proceed" button, the NetCoach query form (Figure 17) appears. Like Sourcerer, NetCoach employs three concept entry windows. The original user query has been automatically entered into the NetCoach form; the use of HTML forms-based hidden fields to pass query information enables this simple interprocess communication.

Figure 17. The NetCoach query screen, with search pattern automatically inserted through interaction with Sourcerer via hidden fields.

[intermediate; 456 x 447 pixels, 11567 bytes]
[full-size; 913 x 894 pixels, 18454 bytes]

Selecting the "Perform MEDLINE search" button results in the return of a summary of search results (Figure 18).

Figure 18. NetCoach search results screen. Forty-one articles were located in which "aspirin" and "bleeding time" co-occur.

[intermediate; 456 x 447 pixels, 10379 bytes]
[full-size; 913 x 894 pixels, 15769 bytes]

Entry of the number of articles to be displayed in the appropriate field in the form of Figure 18 results in the display of matching bilbiographic citations (with abstracts when available).

Figure 19. MEDLINE record for the first article retrieved. The word "aspirin" appears in the title and the abstract, and the (manually highlighted) string "bleeding time" appears in the abstract.

[intermediate; 456 x 447 pixels, 25060 bytes]
[full-size; 913 x 894 pixels, 34430 bytes]

The first such article returned is shown in Figure 19. All 41 of the retrieved articles were found to be appropriate to the original user query.

Discussion

The version of Sourcerer discussed here was intended as a proof-of-concept prototype rather than a fully realized user application. As such, it is not being made publically accessible. However, it has demonstrated the potential utility of using the UMLS Metathesaurus and semantic network, as well the advantages of creating such a service in the context of the World Wide Web. Chief among these is the cross-platform graphical user interface support provided by the Web, which frees the information provider from the onerous and expensive task of creating multiple user interfaces. The chief disadvantage of using the Web arises from the stateless nature of HTTP.

To support the extended interactions between user and service of the sort demonstrated above, in which the current interaction builds upon the results of past ones, requires that the server maintain a historical record of prior transactions with this particular user. This was achieved by the creation of a simple server-side state engine, based upon earlier experience with a large Web-based catalogued image archive [Rodgers & Srinivasan, 1994]. State is maintained in two files: a session database file and a search component file. This former contains the client IP address, a unique session identification number, the time of the last communication from the client, and pointers to active search components, which are stored in the search component file. Each document returned to a client contains the session identification number, in a forms-based hidden field. This allows Sourcerer to check for expired sessions when a document is returned, and to obtain the state information for a session.

A number of challenging problems remain to be addressed in the next stage of this project:

The design of a user interface that is at once simple enough for untrained users, and powerful enough for more advanced users. The interface will also have to take into account various user-imposed constraints, such as the amount of money to be spent if commercial information services are to be accessed.
Optimization of the information retrieval aspects of the system. This requires consideration of a number of related issues:
1. Development of metrics to assess the information retrieval performance of the system. Measurement of precision and recall are only practical when the investigator knows what the optimal responses are for a given query. In addition, the two-tiered approach of Sourcerer (selection of sources, followed by retrieval from those sources) implies the need for a two-tiered measurement of precision and recall. It may be possible to use a set of test questions developed at NLM for past precision/recall studies.
2. The optimal design of ISM database records. An optimal collection of record field types remains to be determined. This is akin to designing a new library cataloging system. The mechanics of delivery of the records also requires attention. We are investigating the use of various commercial and non-commerical relational database systems.
3. Categorization of the types of information resources. It is important ot determine what types of information sources are within scope for this project. There is clearly a difference, for example, between a Web or gopher site that simply points off to other sites, and a server at a major research facility that is acting as the sole source of original data.
4. Optimization of search algorithms. The challenge here is to formulate a search from user-supplied strings so as to optimize information retrieval. The current prototype implements only the AND operator, and searches against only the MeSH, semantic type, and semantic type relationship fields of ISM records. The next prorotype will have to support the full pantheon of logical operators (OR, AND, NOT) in complex patterns, and search against the full contents of the ISM records. The application of Coach-like guided searching is desireable, though it is not likely to be possible to apply this to the the searching of individual databases. In a world of free databases, one approach would be to design a search so as to maximize recall of both information sources and the results returned from the sources, and then to improve precision by post-processing returned results.
Implementation of a true URL to URN resolution service. As currently envisaged by the Uniform Resource Identifier Working Group of the Internet Engineering Task Force, the Uniform Resource Name (URN) is a unique permanent identifier for electronic objects, akin to the familar International Standard Book Number (ISBN). The form of the URN is currently under debate, as is the nature of services that might be designed to accept a URN and return a list of Uniform Resource Locators (URLs). One prominent contender for such a service is an extension to the well-establshed whois directory service, or whois++ [Gargano & Weiss, 1994]. Several proposals have been made regarding how to create and access distributed indexes using this protocol [Faltstrom, Schoultz, & Weider, 1994; Weider, Fullton, & Spero, 1994]. Due to the lack of clear standards for URNs and URL resolution services, implementation of this part of the Sourcerer architecture was deferred.
The creation of conventions for standardizing the interaction with terminal information sources. If Sourcerer is to be fully realized, it must be able to fetch database contents, not just compose lists of potentially useful information sources. The current interaction with MEDLINE demonstrates the potential capability of doing so within the Web environment, but even that interaction is hampered by the need to manually enter a user code and password. The Z39.50 protocol addresses such issues. Integration of Z39.50 into the Web is a possible solution, though unattractive in light of the daunting complexity of Z39.50. It should be possible to implement some of the better ideas of Z39.50 in conventions that can be implemented with little or no modification to current Web protocols.

The goal of the next stage of the Sourcerer Project is to develop a publically usable application. Continuing information about the progress of the UMLS and Sourcerer Projects will be made accessible via HyperDOC, the NLM's Web server (Figure 20).

Figure 20. HyperDOC [National Library of Medicine, 1993], the principal NLM World Wide Web server.

[intermediate; 456 x 447 pixels, 32501 bytes]
[full-size; 901 x 900 pixels, 55626 bytes]

Acknowledgments

The authors thank their colleagues from the UMLS team, and particularly Anna Harbourt, Bill Hole, Betsy Humphreys, Lawrence Kingsland, Dan Masys, Alexa McCray, and Edmund Syed, for their collaboration and critical comments.

References

P. Faltstrom, R. Schoultz, and C. Weider. How to Interact with a Whois++ Mesh, Internet Draft, Internet Engineering Task Force White Pages Requirement Working Group, July 1994 (available at: http://vinca.cnidr.org/protocols/whoispp/mesh00.html).
J. Gargano and K. Weiss. Whois and Network Information Lookup Service Whois++, Internet Engineering Task Force Whois and Network Information Lookup Service Working Group, Internet Draft, June 1994 (available at: http://vinca.cnidr.org/protocols/whoispp/lookup01.html).
A. M. Harbourt, E. J. Syed, and L. C. Kingsland III. The ranking algorithm of the Coach browser for the UMLS metathesaurus, in Proceedings of the Seventeenth Annual Symposium on Computer Applications in Medical Care. 1993: 720-724.
B. L. Humphreys and D. A. B. Lindberg. The UMLS project: making the conceptual connection between users and the information they need, in Bulletin of the Medical Library Association 81(2); 1993: 170-177.
L. C. Kingsland III, E. J. Syed, and A. M. Harbourt. NetCoach: A Web-based System for Intelligent Biomedical Information Retrieval, in Proceedings of the Second International World-Wide Web Conference, Chicago. 1994.
F. W. Lancaster, and A. J. Warner. Language in Retrieval, in Information Retrieval Today. Information Resources Press, Arlington, Virginia. 1993: 89-107.
A. T. McCray, S. Srinivasan, and A. C. Browne. Lexical Methods for Managing Variation in Biomedical Terminologies, in Proceedings of the Eighteenth Annual Symposium on Computer Applications in Medical Care. 1994 (Accepted for publication).
National Library of Medicine. HyperDOC, a Multimedia/Hypertext Resource of the U.S. National Library of Medicine, 1993-present (available at http://www.nlm.nih.gov/).
R. P. C. Rodgers, and J. Levin. A critical reappraisal of the bleeding time, in Seminars in Thrombosis and Hemostasis 16(1); 1990: 1-20.
R. P. C. Rodgers and S. Srinivasan. On-Line Images from the History of Medicine (OLI): Creating a Large Searchable Image Database for Distribution via World-Wide Web, in Proceedings of the First International World-Wide Web Conference. Geneva, 25-27 May 1994: 423-431 (paper available at: http://www.nlm.nih.gov/hmd.dir/oli.dir/paper/paper.html; system available at: http://www.nlm.nih.gov/hmd.dir/oli.dir/).
C. Weider, J. Fullton, and S. Spero. Architecture of the Whois++ Index Service, Internet Engineering Task Force Whois and Network Information Lookup Service Working Group, Internet Draft, July 1994 (available at: http://vinca.cnidr.org/protocols/whoispp/whois03.html).

NLM Sourcerer / Rodgers, Srinivasan, Fullton / Second International WWW Conference, October 1994

Sourcerer: Thesaurus-Assisted Automated Source Identification for the World-Wide Web

R. P. C. Rodgers Suresh Srinivasan Lister Hill National Center for Biomedical Communications

Jim Fullton Clearinghouse for Networked Information Discovery and Retrieval