Introduction

Despite the usefulness of database technologies, users of online information retrieval systems are often overwhelmed by the amount of current information, the subject and system knowledge required to access this information, and the constant influx of new information. The result is termed ``information overload." A second difficulty associated with information retrieval and information sharing is the classical ``vocabulary problem," which is a consequence of diversity of expertise and backgrounds of system users. Previous research in information science and in human-computer interactions has shown that people tend to use different terms (vocabularies) to describe a similar concept - the chance of two people using the same term to describe an object or concept is less than 20%. The ``fluidity" of concepts and vocabularies, especially in the scientific and engineering domains, further complicates the retrieval issue. A scientific or engineering concept may be perceived differently by different researchers and it may also convey different meanings at different times. To address the ``information overload" and the ``vocabulary problem" in a large information space that is used by searchers of varying backgrounds, a more ``intelligent" and proactive search aid is needed.

The problems of information overload and vocabulary difference have become more pressing with the emergence of the increasingly more popular internet resource discovery services. Retrieval difficulties, we believe, will worsen as the amount of online information increases in an accelerating pace under the National Information Infrastructure. Although network protocols and software such as Mosaic and WAIS support significantly easier importation of online information sources, their use is accompanied by the adverse problem of users not being able to explore and find what they want in an enormous information space.

The main information retrieval mechanisms provided by the prevailing resource discovery software and other information retrieval systems are either based on ``keyword search" (inverted index or full text) or ``user browsing." Keyword search often causes low precision and poor recall due to the limitations of controlled language based interfaces (the vocabulary problem) and the inability of searchers themselves to fully articulate their needs. Furthermore, browsing only allows users to explore a very small portion of a large and unfamiliar information space, which was constructed based in the first place on the system designer's view of the world. A large information space organized based on hypertext-like browsing can also potentially confuse and disorient its user, the ``embedded digression problem;" and it can cause the user to spend a great deal of time while learning nothing specific, the ``art museum phenomenon." This research aims to provide a semantic, concept-based retrieval option that could supplement existing information retrieval options.

Our proposed approach is based on textual analysis of a large corpus of domain-specific documents in order to generate a large set of subject vocabularies. By adopting the cluster analysis techniques to analyze the co-occurrence probabilities of the subject vocabularies, a similarity (relevance) matrix of vocabularies can be built to represent the important concepts and their weighted ``relevance" relationships in the subject domain. To create a network of concepts, which we refer to as the ``concept space" for the subject domain (to distinguish it from its underlying ``information space"), we propose to develop general AI-based graph traversal algorithms (e.g., serial, optimal branch-and-bound search algorithms or parallel, Hopfield net like algorithms) and graph matching algorithms (for intersecting concept spaces in related domains) to automatically translate a searcher's preferred vocabularies into a set of the most semantically relevant terms in the database's underlying subject domain. By providing a more understandable, system-generated, semantics-rich concept space as an abstraction of the enormously complex information space plus algorithms to assist in concept/information spaces traversal, we believe we can greatly alleviate both information overload and the vocabulary problem.

In this paper, we first review our concept space approach and the associated algorithms in Section 2. In Section 3, we present our experience in using such an approach. In Section 4, we review our research plan for building a semantics-rich Interspace for a multi-million dollar digital library project recently awarded by NSF/ARPA/NASA (1994-1998) to the University of Illinois. In particular, we will discuss the planned semantic retrieval and user customization capabilities for the next-generation NCSA Mosaic.

Next: The Concept Space Up: Semantic Retrieval for the Previous: Semantic Retrieval for the

hchen@bpa.arizona.edu / bshatz@ncsa.uiuc.edu