To alleviate information overload and the vocabulary problem in information retrieval, researchers in human-computer interactions and information science have suggested expanding the vocabularies for objects and linking vocabularies of similar meanings. For example, Furnas et al. (1987) [8] proposed ``unlimited aliasing," which creates multiple identities for the same underlying object. In information science, Bates (1986) [1] proposed using a domain-specific dictionary to expand user vocabularies in order to allow users to ``dock" onto the system more easily. The general idea of creating rich vocabularies and linking similar ones together is sound and its usefulness has been verified in previous research and in many real-life information retrieval environments (e.g., reference librarians often consult a domain-specific thesaurus to help users in online subject search). However, the bottleneck for such techniques is often the manual process of creating vocabularies (aliases) and linking similar or synonymous ones (for example, the effort involved in creating an up-to-date, complete, and subject-specific thesaurus is often overwhelming and the resulting thesaurus may quickly become obsolete for lack of consistent maintenance).
Based on our experiences in dealing with several business, intelligence, and scientific textual database applications, we have developed an algorithmic and automatic approach to creating a vocabulary-rich dictionary/thesaurus, which we call the concept space. In our design, we generate such a concept space by first extracting concepts (terms) automatically from the texts in the domain-specific databases. Similar concepts are then linked together by using several elaborate versions of co-occurrence analysis of concepts in texts. Finally, through generating concept spaces of different (but somewhat related) domains, intersecting common concepts, and providing graph traversal algorithms to lead concepts from a searcher's domain (queries expressed in his/her own vocabulary) to the target database domain, the concept space approach allows a searcher to explore in a large information space effortlessly and ``intelligently." We present the blueprint of this approach below.
A. Concept Identification:
The first task for concept space generation is to identify the vocabularies used in the textual documents. AI-based Natural Language Processing (NLP) techniques have been used for generating detailed, unambiguious internal representation of English statements. However such techniques are either too computationally intensive or are domain-dependent and are inappropriate for identifying content descriptors (terms, vocabularies) from texts in diverse domains. An alternative method for concept identification that is simple and domain-independent is the automatic indexing method, often used in information science for indexing literature. Automatic indexing typically includes dictionary look-up, stop-wording, word stemming, and term-phrase formation. Another technique (often called ``object filtering") which could supplement the automatic indexing technique involves using existing domain-specific keyword lists (e.g., a list of company names, gene names, researchers' names, etc.) to help identify specific vocabularies used in texts.
B. Concept Space Generation:
While automatic indexing and object filtering identify vocabularies used in texts, the relative importance of each term for representing concepts in a document may vary. That is, some of the vocabularies used may be more important than others in conveying meanings. The vector space model in information retrieval associates with each term a weight to represent its descriptive power (a measure of importance). Based on cluster analysis techniques, the vector space model could be extended for concept space generation, where the main objective is to convert raw data (i.e., terms and weights) into a matrix of ``similarity" measures between any pair of terms. The similarity measure computation is mainly based on the probabilities of terms co-occurring in the texts. The probabilistic weights between two terms indicate their strengths of relevance or association. We have developed several versions of similarity functions which considered the unique characteristics of the individual terms such as: the position of a term (e.g., a term in title vs. in abstract), the number of words of a term, the appearance date of the term (i.e., the publication year of the document which produced the term), and the identifiable type of the term (e.g., a person's name, a subject descriptor, a gene name, a company's name, etc.).
The proposed concept space generation techniques aim to defeat the vocabulary (difference) problem by identifying ``vocabulary similarity" automatically. The output of the cluster analysis process can be perceived as an inter-connected, weighted network of terms (vocabularies), with each link representing the degree of similarity between two terms.
C. Intersecting and Traversing Multiple Concept Spaces:
A fundamental problem in information retrieval is to link the vocabularies used by a searcher (those he/she feels most comfortable and natural to use to express his/her own information need) with the vocabularies used by a system (i.e., indexes of the underlying database). By creating a target concept space using the texts of the underlying database (e.g., a C. elegans worm database) and another concept space from texts representative of the searcher's reference discipline (e.g., human genome) and intersecting and traversing the two concept spaces algorithmically, we believe we will be able to create a knowledgeable online search aide which is capable of bridging the vocabulary difference between a searcher and a target database, thereby helping alleviate the information overload problem in a large information space. We have tested a serial branch-and-bound search algorithm and a parallel Hopfield-like neural network algorithm for multiple-thesauri consultation in previous research (Chen, Lynch, Basu, and Ng, IEEE Expert, 1993) [5]. The initial results were promising.
In conclusion, by acquiring vocabularies from texts directly (instead of from human experts), either in incremental mode or by periodic batch processing, and by creating concept spaces for the target databases and other related subject disciplines (i.e., pre-processing selected source textual documents), a system will be able to help searchers articulate their needs and to retrieve semantically (conceptually) relevant information in a more effortless and effective way.