Amir M. Mobini
Viswanath Kadambari
Unversity of Florida
Abstract: In spite of the enormous complexities of building natural language query systems for large databases, much can be learned by modeling information systems after the work which has been done in natural language understanding. Language may ultimately be the only way to access information from large systems. We will review conventional information retrieval techniques, and their well-known limitations. We will present a progress report on our on-going research and development in building an information retrieval system based on semantic data modeling techniques. We have implemented this system in the agricultural domain, and while we are not yet talking to our computer, we show how the system evolution is building data structures which ultimately will lead to language understanding. We will examine sources of the information we are using for constructing concept networks and illustrate with examples. This will include examination of our existing database and theoretical work on knowledge discovery in databases such as schema evolution, conceptual clustering, and query processing.
Though providing access to digital libraries through natural language remains a formidable challenge, techniques used in natural language processing (NLP) can be applied to organizing large information collections and could ultimately lead to access of information by simple dialogues. The conventional hypertext model of data organization is not adequate for organizing large collections, and cannot support searches which are needed to help users locate specific information directly, particularly on a global level where they are most needed.
For example, it does not seem extraordinary for a user to ask simple questions such as "how can I control insect pests", or "what materials should I place in a compost pile", and get a direct, simple answer. These questions do not have a complicated grammatical structure, nor are they excessively ambiguous. Yet current searching techniques on World-Wide Web (WWW) and other digital libraries cannot handle these simple questions. Even if they were reduced to boolean keyword queries ("insects and control", "compost and materials"), assuming the user knew how to formulate such queries, precision and recall in keyword searching is known to be quite low [22]. Users retrieve many answers (which are pointers to articles they must then read rather than direct answers to questions) which are irrelevant to their needs, and they miss retrieving many which are relevant. This is particularly true in large databases, and in databases the size of WWW the problem becomes acute.
We will see that the barrier to doing this exists in the use of conventional hypermedia techniques to structure knowledge. This paper examines conventional searching techniques, and explores the use of semantic data models for information organization in hypermedia systems. Such models can assist in indexing and retrieval and offer short term solutions to organizing data. Progressive levels of implementation of these models can ultimately lead to support for NLP.
Nodes in an SDM can also be viewed as objects having taxonomic associations and attributes with values. Each object has a unique identifier called the Object Identifier (OID). Taxonomic associations indicate the position of the object in a generalization/specialization taxonomy. Part/whole and associational relationships become attributes. Attributes have one or more values which can be simple (other classes or instances, strings, integers, real numbers) or complex (imbedded objects, ordered and unordered sets, ranges). Causal and temporal relationships can be expressed by using association attributes . For example, the temporal sequence of insect development from egg, to larvae, to pupae, to adult is expressed by temporal associational links.
In addition to query processing, SDMs can support other inferencing operations. Imprecise query processing uses analogical reasoning to retrieve answers which are close to the desired answer but may not match exactly (near misses). Schema evolution is the process of building the database and, in particular, defining database classes or schemas. This process can be automated somewhat to provide assistance to database designers. For example, automatic classification can be used to determine where a new object should be placed in the generalization taxonomy. Discovering such associations in the network automatically is a form of machine learning or knowledge discovery in databases [11]. We are developing a conceptual clustering algorithm [15] which unifies many of these approaches into a single theoretical framework. The algorithm attempts to mimic cognitive processes involved in building categories.
In our approach of graph matching in semantic networks, we integrate the two processes of knowledge acquisition and knowledge retrieval. The process of adding new information to the network is essentially the same as searching. This is viewed as the "indexing problem" from the knowledge acquisition viewpoint or the "mapping problem" from the query processing perspective [21]. Both processes involve matching a smaller concept graph to a universal concept network based on the explicit and implicit meanings conveyed and their relevant attributes.
We built a case-base for the word "on" by observing its occurrence in random samples of text. This case-base contains entries such as:
get a grip on reared on grip on the strap thrive on on legal grounds relate on an emotional level grow on held on bondThe current case-base for "on" contains 700 entries. We have found that the case-base for most words is actually much less than this, but "on" is very common and is used in many situations. As a test, we looked at new text (text which was not used to build the case base), and found that 98% of the occupancies of "on" in the new text could be related to the existing cases either directly or through an analogous use. The strength of case-based reasoning lies in its flexibility. It is not required that a new usage exactly match an existing case, in fact much of the creativity of language involves slight modifications in usage. We can understand the new usage by relating it to existing cases.
We are using a chart parser to produce a grammatical representation of the phrase. The initial step in analyzing a phrase involves identifying the "head" of the phrase. The head is modified by the other parts of the phrase and plays an essential role in the mapping process. The parser systematically applies phrasal patterns from a large case base to the phrase being analyzed. Each pattern which matches is associated with a template which is filled with values from various locations in the pattern.
Figure 2 illustrates this process for both the addition of a new information based on an input sentence and processing a related query. The query is represented by a small graph which initiates the mapping to the semantic hierarchy. The small graph is mapped to the semantic network by creating a link from each node in the smaller graph to the corresponding nodes in the network starting with the most general concept (the head) and ending with the most specific. This will create a unique instance which is the intersection of all of the nodes involved in the query and may be used to narrow down a neighborhood based on the requested information. Figure 3 demonstrates this process and displays how the answer to the query is produced by exploiting the relationship between "spending leisure time" and "having a chance to go fishing" (both are "doing").
In keyword searching, users can enter a word or combination of words and phrases, including boolean combinations of such terms, and locate text containing these terms. The text search may be limited to file names or menu items as in Archie [2] and Veronica [6], brief record descriptions of documents as in Aliweb [8]. There is also a limit to how much of the information on the network is covered by a particular keyword searching program.
Keyword searches typically make uses of a pre-compiled index (inverted list) which contains an entry for each word that has pointers to all documents containing that word. The keyword search can be done at run time [3] without using an index by searching documents during query processing (this is very inefficient and would be limited to a small, localized collection of publications). A combination of the two approaches can be used by which new queries are processed at run time and the results added to an index for future reference.
The nature of WWW presents an unusual problem for building indexes. Since there is no control over when and how documents are added to the systems, there is no way to insure that they are added to an index. This problem is further complicated when the document is modified. This problem is addressed somewhat by the use of robots and spiders such as Lycos [5] and WebCrawler [9]. These programs automatically search the network by traversing links, and locate HTML documents which are then processed and added to an index. But such programs place a heavy burden on network resources, particularly since they must search the network repeatedly to find updated materials (both new and revised). If sites can actively participate by developing indexes over their own materials and passing these local indexes up to regional and global sites, better management can be achieved (as was done in Aliweb [1]).
Finally, there appears to be little work being done on visual interfaces for global searching, as was done with the Virtual Tourist [7]. Such interfaces (which are essentially a variation of hypertext browsing) can be very effective for local searching. But it may not be possible to visually communicate the information content of an entire digital library. That implies that words rather than pictures are the key to unlocking access to such libraries.
Indexing documents onto the concept map in WordViews can still be automated by observing the occurrences of thesaurus terms within a particular document. Queries in WorldViews are based on boolean search, relying on keywords drawn from the thesaurus. In the next section, we will explore how search can be further improved by utilizing syntactic relationships in simple natural language expressions rather than boolean queries.
We have also used the objects in this system to represent and store documents [14]. Our current WWW site (http://hammock.ifas.ufl.edu) consists of HTML files which were generated from our database. In the future we plan to use the database directly as a server, thus capitalizing on its searching capabilities to provide better access to the information.
Our current efforts are aimed at automatically generating a thesaurus from existing sources, and developing searching techniques over this thesaurus. Some existing sources include the Library of Congress Subject Headings, and Agricola Subject Category Codes from the National Agricultural Library. Since these thesauri already contain a great deal of information on association among concepts, it is possible to import these automatically into the SDM.
[1] Aliweb. http://www.cs.indiana.edu/aliweb/search [2] Archie. ftp://sifon.cc.mcgill.ca/pub/Network/archie [3] Fish Search. http://www.win.tue.nl/help/help-on-fish-all.html [4] Joel's Hierarchical Subject Index. http://www.cen.uiuc.edu/~jj9544 [5] Lycos. http://fuzine.mt.cs.cmu.edu/cgi-bin/pursuit [6] Veronica. ftp://sunic.sunet.se/old-gopher-data/veronica [7] Virtual Tourist. http://wings.buffalo.edu/world [8] WAIS. http://info.cern.ch/hypertext/Products/WAIS/Overview.html [9] WebCrawler. http://www.biotech.washington.edu/WebCrawler/WebCrawler.html [10] The WWW Virtual Library. http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html [11] Anwar, T. M., H. W. Beck, and S. B. Navathe. 1992. Knowledge Mining by Imprecise Querying: A Classification-Based Approach. IEEE Eight International Conference on Data Engineering. Phoenix, Arizona. [12] Beck, H. W., S. K. Gala, and S. B. Navathe. 1989. Classification as a query processing technique in the CANDIDE semantic data model. Proc. Fifth International Conference on Data Engineering. IEEE. Los Angeles, CA. pp. 572-581. [13] Beck, H. W. 1991. Language acquisition from Cases. Proceedings of the DARPA Case-based Reasoning Workshop. Morgan Kaufmann, Inc. San Mateo, CA. pp 159-169. [14] Beck, H. W. and D. Watson. 1992. Analysis of Agricultural Extension Documents. AI Applications 6(3):17-28. [15] Beck, H. W., T. Anwar, and S. Navathe. 1994. A conceptual clustering algorithm for database schema design. IEEE Transactions on Knowledge and Data Engineering. 6(3):396-411. [16 ]Beck, H. W., P. Jones, and D. G. Watson. 1994. A CD-ROM Based Agricultural Information Retrieval System. Applied Engineering in Agriculture. 10(1):127-132. [17] Borgida, A., R. Brachman and D. McGuinness and L. Resnick. 1989. CLASSIC: A Structural Data Model for Objects. Proc. ACM SIGMOD International Conference on Management of Data. J. Clifford and B. Lindsay and D. Maier (eds.) ACM Press. New York, NY. pp. 58-67. [18] Elmasri, R., J. Weeldreyer, and A. Hevner". 1985. The Category Concept: An Extension to the Entity-Relationship Model. International Journal on Data and Knowledge Engineering. 1(1). [19] A. Ginsberg, A. 1993. A Unified Approach to Automatic Indexing and Information Retrieval," IEEE Expert, pp. 46-56. [20] Jacobs, P., and L. Rau. 1990. SCISOR: Extracting Information from On-line News. Communications of the ACM. 33(11):88-97. [21] Kolodner, J. 1990. An introduction to case-based reasoning. Tech. Rep. No. GIT-ICS-90/19. School of Info. and Comp. Science, Georgia Inst. of Technology. Atlanta, GA. [22] Salton, G. and M. McGill. 1983. Introduction to Modern Information Retrieval, New York: McGraw-Hill. [23] Su, S. Y. W. , V. Krishnamurthy, and H. Lam. 1988. An object-oriented semantic association model (OSAM*). AI in Industrial Engineering and Manufacturing: Theoretical Issues and Applications. S. Kumar and A.L. Soyster and R.L. Kashyap (eds.). American Institute of Industrial Engineers. [24] Sundheim, B. Ed. 1992. Proc. Fourth Message Understanding Conference. Morgan Kaufmann. San Mateo, CA.Send Correspondence To:
Howard W. Beck Agricultural Engineering Department 12 Rogers Hall University of Florida Gainesville, FL 32611 hwb@cis.ufl.edu