A Word is Worth 1000 Pictures: Natural Language Access to Digital Libraries

Howard W. Beck
  
Amir M. Mobini
  
Viswanath Kadambari

Unversity of Florida

Abstract: In spite of the enormous complexities of building natural language query systems for large databases, much can be learned by modeling information systems after the work which has been done in natural language understanding. Language may ultimately be the only way to access information from large systems. We will review conventional information retrieval techniques, and their well-known limitations. We will present a progress report on our on-going research and development in building an information retrieval system based on semantic data modeling techniques. We have implemented this system in the agricultural domain, and while we are not yet talking to our computer, we show how the system evolution is building data structures which ultimately will lead to language understanding. We will examine sources of the information we are using for constructing concept networks and illustrate with examples. This will include examination of our existing database and theoretical work on knowledge discovery in databases such as schema evolution, conceptual clustering, and query processing.

Introduction

Though providing access to digital libraries through natural language remains a formidable challenge, techniques used in natural language processing (NLP) can be applied to organizing large information collections and could ultimately lead to access of information by simple dialogues. The conventional hypertext model of data organization is not adequate for organizing large collections, and cannot support searches which are needed to help users locate specific information directly, particularly on a global level where they are most needed.

For example, it does not seem extraordinary for a user to ask simple questions such as "how can I control insect pests", or "what materials should I place in a compost pile", and get a direct, simple answer. These questions do not have a complicated grammatical structure, nor are they excessively ambiguous. Yet current searching techniques on World-Wide Web (WWW) and other digital libraries cannot handle these simple questions. Even if they were reduced to boolean keyword queries ("insects and control", "compost and materials"), assuming the user knew how to formulate such queries, precision and recall in keyword searching is known to be quite low [22]. Users retrieve many answers (which are pointers to articles they must then read rather than direct answers to questions) which are irrelevant to their needs, and they miss retrieving many which are relevant. This is particularly true in large databases, and in databases the size of WWW the problem becomes acute.

We will see that the barrier to doing this exists in the use of conventional hypermedia techniques to structure knowledge. This paper examines conventional searching techniques, and explores the use of semantic data models for information organization in hypermedia systems. Such models can assist in indexing and retrieval and offer short term solutions to organizing data. Progressive levels of implementation of these models can ultimately lead to support for NLP.

The Basic Approach

Semantic Data Models

Semantic data models (SDM) are widely researched in the database community [17, 18, 23]. They are closely related to semantic networks used in artificial intelligence which were originally developed to support natural language processing. Hence, as database management systems they are capable of supporting large information systems such as those existing on WWW, yet offer potential of advanced inferencing capabilities including NLP, machine learning, and query processing.

Model Structure

SDMs can be seen as formalizing many of the relationships which are expressed in an ad hoc manor in conventional hypermedia systems. SDMs support a variety of formalized links and relationships. An example of a small network on insects is shown in Figure 1. The links in this graph express generalization/specialization relationships or "ISA" (beneficial insect ISA insect), part/whole (Abdomen is part of an Insect), association (Ladybugs eat Aphids), and class/instance (Ladybug is an instance of Beneficial Insect).

Nodes in an SDM can also be viewed as objects having taxonomic associations and attributes with values. Each object has a unique identifier called the Object Identifier (OID). Taxonomic associations indicate the position of the object in a generalization/specialization taxonomy. Part/whole and associational relationships become attributes. Attributes have one or more values which can be simple (other classes or instances, strings, integers, real numbers) or complex (imbedded objects, ordered and unordered sets, ranges). Causal and temporal relationships can be expressed by using association attributes . For example, the temporal sequence of insect development from egg, to larvae, to pupae, to adult is expressed by temporal associational links.

Inferencing Supported by SDMs

Since concepts in SDMs are described by structured graphs expressing the relationships among symbols rather than connections between text files as in conventional hypertext, there exists the capability for manipulation of SDMs to produce a number of desirable functions. Foremost is that of search or query processing. We are experimenting with query processing based on graph matching techniques by which the query is expressed as a small graph (a small semantic network). This query graph is then matched against the larger database graph to find connections. This gives a much more precise search capability than is possible with boolean keyword searches over text files. As we will show below, natural language expressions can be converted to such query graphs relatively easily.

In addition to query processing, SDMs can support other inferencing operations. Imprecise query processing uses analogical reasoning to retrieve answers which are close to the desired answer but may not match exactly (near misses). Schema evolution is the process of building the database and, in particular, defining database classes or schemas. This process can be automated somewhat to provide assistance to database designers. For example, automatic classification can be used to determine where a new object should be placed in the generalization taxonomy. Discovering such associations in the network automatically is a form of machine learning or knowledge discovery in databases [11]. We are developing a conceptual clustering algorithm [15] which unifies many of these approaches into a single theoretical framework. The algorithm attempts to mimic cognitive processes involved in building categories.

Symbolic Approach to NLP

The symbolic approach to NLP involves treating words as nodes in a semantic network and associations among words (such as those occurring in a sentence) as links between nodes in this network. It is possible to describe the structure of a sentence in natural language by using a semantic network. Once in this graphical form, graph matching and other techniques for manipulating networks can be used to draw inferences, trigger associations and, for the task at hand, process queries.

In our approach of graph matching in semantic networks, we integrate the two processes of knowledge acquisition and knowledge retrieval. The process of adding new information to the network is essentially the same as searching. This is viewed as the "indexing problem" from the knowledge acquisition viewpoint or the "mapping problem" from the query processing perspective [21]. Both processes involve matching a smaller concept graph to a universal concept network based on the explicit and implicit meanings conveyed and their relevant attributes.

Case-based Reasoning as a Theory of Word Meaning

In order to construct a NLP system, one must construct a large dictionary. Much of the recent advances in text understanding systems [24] can be attributed to advances in design and construction of large lexicons. But that presupposes that we understand how to represent word meaning. We use a case-based reasoning approach to meaning [13,21]. Words obtain meaning by how they are used. We hear a particular word being used in many different situations and contexts; we may hear the same word used in hundreds or thousands of situations. Each occurrence of the word is treated as one case. Similarities among cases can be observed, and cases with similar usage can be clustered together into categories. When we hear a word used in a new situation, according to this theory, we retrieve cases from our case-based memory which are similar to the new situation in order to apply what happened before to the new situation. The meaning of a particular word is established by a large case base, and thus a single word may be "worth 1000 cases".

We built a case-base for the word "on" by observing its occurrence in random samples of text. This case-base contains entries such as:

        get a grip on		reared on
        grip on the strap	thrive on
        on legal grounds	relate on an emotional level
        grow on			held on bond

The current case-base for "on" contains 700 entries. We have found that the case-base for most words is actually much less than this, but "on" is very common and is used in many situations. As a test, we looked at new text (text which was not used to build the case base), and found that 98% of the occupancies of "on" in the new text could be related to the existing cases either directly or through an analogous use. The strength of case-based reasoning lies in its flexibility. It is not required that a new usage exactly match an existing case, in fact much of the creativity of language involves slight modifications in usage. We can understand the new usage by relating it to existing cases.

Parsing and Graph Matching

The steps of analyzing a simple natural language expression involve analyzing the grammatical structure to determine the association among words in the expression, using these associations to instanciate templates and produce a semantic network representation of the expression's meaning, and matching this network against the larger database network in order to relate the expression to existing information.

We are using a chart parser to produce a grammatical representation of the phrase. The initial step in analyzing a phrase involves identifying the "head" of the phrase. The head is modified by the other parts of the phrase and plays an essential role in the mapping process. The parser systematically applies phrasal patterns from a large case base to the phrase being analyzed. Each pattern which matches is associated with a template which is filled with values from various locations in the pattern.

Figure 2 illustrates this process for both the addition of a new information based on an input sentence and processing a related query. The query is represented by a small graph which initiates the mapping to the semantic hierarchy. The small graph is mapped to the semantic network by creating a link from each node in the smaller graph to the corresponding nodes in the network starting with the most general concept (the head) and ending with the most specific. This will create a unique instance which is the intersection of all of the nodes involved in the query and may be used to narrow down a neighborhood based on the requested information. Figure 3 demonstrates this process and displays how the answer to the query is produced by exploiting the relationship between "spending leisure time" and "having a chance to go fishing" (both are "doing").

Existing Searching Techniques on WWW

Existing searching techniques on WWW fall into two main categories: hypertext browsing and keyword searching (and a combination of the two). Navigating in hypermedia systems traditionally involves manual traversal of hyperlinks. It is well known that this technique by itself cannot be used to search hypermedia systems containing more than several hundred nodes. Users quickly become lost because of the size of the search space. Nevertheless, subject trees [10], which offer navigational pointers by subject for collected works, are widely used on WWW. If they can only be searched by navigation, then the user must know the meaning of very broad terms, and be able to judge where the specific information of interest falls under those terms.

In keyword searching, users can enter a word or combination of words and phrases, including boolean combinations of such terms, and locate text containing these terms. The text search may be limited to file names or menu items as in Archie [2] and Veronica [6], brief record descriptions of documents as in Aliweb [8]. There is also a limit to how much of the information on the network is covered by a particular keyword searching program.

Keyword searches typically make uses of a pre-compiled index (inverted list) which contains an entry for each word that has pointers to all documents containing that word. The keyword search can be done at run time [3] without using an index by searching documents during query processing (this is very inefficient and would be limited to a small, localized collection of publications). A combination of the two approaches can be used by which new queries are processed at run time and the results added to an index for future reference.

The nature of WWW presents an unusual problem for building indexes. Since there is no control over when and how documents are added to the systems, there is no way to insure that they are added to an index. This problem is further complicated when the document is modified. This problem is addressed somewhat by the use of robots and spiders such as Lycos [5] and WebCrawler [9]. These programs automatically search the network by traversing links, and locate HTML documents which are then processed and added to an index. But such programs place a heavy burden on network resources, particularly since they must search the network repeatedly to find updated materials (both new and revised). If sites can actively participate by developing indexes over their own materials and passing these local indexes up to regional and global sites, better management can be achieved (as was done in Aliweb [1]).

Finally, there appears to be little work being done on visual interfaces for global searching, as was done with the Virtual Tourist [7]. Such interfaces (which are essentially a variation of hypertext browsing) can be very effective for local searching. But it may not be possible to visually communicate the information content of an entire digital library. That implies that words rather than pictures are the key to unlocking access to such libraries.

Steps towards Natural Language Understanding

Here we present a progression of steps beginning with what is already being done today, to things which could be done within practical limitations, and ending with approaches which, though too difficult at present to implement on a large scale, act as a theoretical goal to what can be achieved.

Fulltext Search/Hypermedia

The conventional searching techniques are, in one perspective, not fundamentally different from what is proposed below. In a way, the WWW is a network of interconnected concepts and associations among related concepts. Keyword search can be viewed as matching words against this network in order to identify relevant neighborhoods within the network for further exploration. What follows below is an attempt to formalize this process by working towards representation of word meanings which lead to an understanding of search terms and by cleaning up the network design which improves searching over concepts.

Thesaurus

There are several ways to make incremental improvements to keyword searching. One is to include a thesaurus. The thesaurus can be represented by an SDM, thus general associations among thesaurus terms (general/specific, part/whole, synonyms) can be captured in a concept network. WorldViews, a thesaurus based query system, provides a "conceptual map of the information space" which allows for retrieval of information that is implicitly referenced by the query as well as the explicit references [19]. For example, if a piece of text W contains an explicit reference to a WorldViews concept C, it also contains implicit references to any node that is more general than C.

Indexing documents onto the concept map in WordViews can still be automated by observing the occurrences of thesaurus terms within a particular document. Queries in WorldViews are based on boolean search, relying on keywords drawn from the thesaurus. In the next section, we will explore how search can be further improved by utilizing syntactic relationships in simple natural language expressions rather than boolean queries.

Syntactic Analysis

Further improvements can be made by utilizing the relations among the concepts in a query. In natural language, relationships among words are established by grammar or syntax. The syntactic analysis of a given query produces a structured representation of the query which allows for comparison to patterns stored in the database. The relationships among concepts in the query are mirrored by relationships among concepts in the database. So, in our example sentence, "He [John] had a chance to go fishing", although a thesaurus based search system such as WorldViews would recognize the implicit reference to activities, it would have no references to the relationship of "John" and "fishing" in this particular instance. In syntactic analysis, this relationship would be exposed through the particular pattern used in the sentence and the corresponding object template. In fact, the relation "having a chance to" would have implicit references to several more general concepts such as "doing" or " being able to." So, although a query such as "How did John spend his leisure time?" could be reduced to a boolean search, there would be two problems without syntactic analysis. First if we simply use "John and leisure and time," we would most likely retrieve many more instances than the one we need since there is no restriction on their relationships (i.e. John could be planning or thinking about his leisure time or, as implied in other parts of the sample instance, he could be "missing" his leisure time). Second, if we do include "spend" in our search string we would not retrieve the instance we are looking for because there is no reference to "spending" in our sample instance. The syntactic analysis is, therefore, essential for reaching a higher level of precision and recall.

Natural Language Dialogues

The systems and designs discussed so far still are based on assisted searches in which the user is responsible for doing some manual navigation within a local area, and retrieving and reading documents which contain the relevant information. This is still short of a full natural language dialogue. In a human-machine dialogue, as opposed to single sentence processing, a conversation takes place, and during the conversation the computer begins to build a model of the user's intentions, may adapt to the user's vocabulary, and resolve pronouns, ellipsis, and other references. The system replies to user's questions using generated natural language expressions rather than retrieving and displaying pre-existing text files, thus providing direct answers to questions. There are research systems capable of providing this level of performance [20]. The message understanding conference (MUC) [24] brings together some of the best research systems for analyzing and extracting information from text in a competition. They work on many of the same principles briefly outlined in this paper. However, they are thus far limited to very narrow domains.

Implementation Status

For the past several years we have been producing agricultural databases and distributing them on CD-ROM [16]. Locally we have developed an SDM for use in construction of information retrieval systems [12]. We are currently using this to model relationships between concepts within and among various disciplines. Thus far the approach has proven beneficial for organizing our information, which includes text, fielded data, images, movies, sounds, mathematical equations, expert systems, and computer simulations. This approach has advantages over the traditional hypermedia model by providing basic database services such as storage management, version control, integrity checks, and security. Because of SDM's ability to describe a wider variety of data types, we can include multimedia objects and other unusual types (such as rules in an expert system or mathematical equations) directly in the database and provide the same level of service for these types that can be done for numerical or textual data.

We have also used the objects in this system to represent and store documents [14]. Our current WWW site (http://hammock.ifas.ufl.edu) consists of HTML files which were generated from our database. In the future we plan to use the database directly as a server, thus capitalizing on its searching capabilities to provide better access to the information.

Our current efforts are aimed at automatically generating a thesaurus from existing sources, and developing searching techniques over this thesaurus. Some existing sources include the Library of Congress Subject Headings, and Agricola Subject Category Codes from the National Agricultural Library. Since these thesauri already contain a great deal of information on association among concepts, it is possible to import these automatically into the SDM.

References


[1] Aliweb. http://www.cs.indiana.edu/aliweb/search

[2] Archie.  ftp://sifon.cc.mcgill.ca/pub/Network/archie

[3] Fish Search. http://www.win.tue.nl/help/help-on-fish-all.html

[4] Joel's Hierarchical Subject Index. http://www.cen.uiuc.edu/~jj9544

[5] Lycos. http://fuzine.mt.cs.cmu.edu/cgi-bin/pursuit

[6] Veronica. ftp://sunic.sunet.se/old-gopher-data/veronica

[7] Virtual Tourist.  http://wings.buffalo.edu/world

[8] WAIS.  http://info.cern.ch/hypertext/Products/WAIS/Overview.html

[9] WebCrawler. http://www.biotech.washington.edu/WebCrawler/WebCrawler.html

[10] The WWW Virtual Library.
     http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html

[11] Anwar, T. M., H. W. Beck, and S. B. Navathe. 1992. Knowledge Mining by
Imprecise Querying: A Classification-Based Approach.  IEEE Eight
International Conference on Data Engineering.  Phoenix, Arizona.

[12] Beck, H. W., S. K. Gala, and S. B. Navathe. 1989.  Classification as a
query processing technique in the CANDIDE semantic data model. Proc. Fifth
International Conference on Data Engineering. IEEE. Los Angeles, CA.
pp.  572-581.

[13] Beck, H.  W.  1991.  Language acquisition from Cases. Proceedings of
the DARPA Case-based Reasoning Workshop.  Morgan Kaufmann, Inc.  San Mateo,
CA. pp 159-169.

[14] Beck, H. W. and D. Watson.  1992. Analysis of Agricultural Extension
Documents.  AI Applications 6(3):17-28.

[15] Beck, H. W., T. Anwar, and S. Navathe. 1994.  A conceptual clustering
algorithm for database schema design.  IEEE Transactions on Knowledge and
Data Engineering. 6(3):396-411.

[16 ]Beck, H. W.,  P. Jones, and D. G. Watson.  1994. A CD-ROM Based
Agricultural Information Retrieval System.  Applied Engineering in
Agriculture. 10(1):127-132.

[17] Borgida, A., R. Brachman and D. McGuinness and L. Resnick. 1989.
CLASSIC:  A Structural Data Model for Objects. Proc. ACM SIGMOD
International Conference on Management of Data. J. Clifford and B. Lindsay
and D. Maier (eds.) ACM Press. New York, NY. pp. 58-67.

[18] Elmasri, R., J. Weeldreyer, and A. Hevner". 1985. The Category Concept:
An Extension to the Entity-Relationship Model. International Journal on Data
and Knowledge Engineering. 1(1).

[19] A. Ginsberg, A. 1993. A Unified Approach to Automatic Indexing and
Information Retrieval," IEEE Expert, pp. 46-56.

[20] Jacobs, P., and L. Rau.  1990. SCISOR: Extracting Information from
On-line News. Communications of the ACM.  33(11):88-97.

[21] Kolodner, J. 1990. An introduction to case-based reasoning. Tech. Rep.
No. GIT-ICS-90/19. School of Info. and Comp. Science, Georgia Inst. of
Technology. Atlanta, GA.

[22] Salton, G.  and M. McGill. 1983. Introduction to Modern Information
Retrieval, New York: McGraw-Hill.

[23] Su, S. Y. W. , V. Krishnamurthy, and H. Lam. 1988. An object-oriented
semantic association model (OSAM*). AI in Industrial Engineering and
Manufacturing: Theoretical Issues and Applications. S. Kumar and
A.L. Soyster and R.L. Kashyap (eds.). American Institute of Industrial
Engineers.

[24] Sundheim, B. Ed. 1992. Proc. Fourth Message Understanding Conference.
Morgan Kaufmann.  San Mateo, CA.

Send Correspondence To:


Howard W. Beck
Agricultural Engineering Department
12 Rogers Hall
University of Florida
Gainesville, FL  32611
hwb@cis.ufl.edu