Our Experience in Using the Approach

We have tested the proposed techniques in several domains. In (Chen and Lynch, IEEE SMC, 1992) [4], we generated a Russian computing concept space based on an asymmetric similarity function we had developed. Using the indexes extracted from about 40,000 documents (200 MBs) and several weeks of CPU time on a small VAX VMS, we were able to generate a robust and domain-specific Russian computing thesaurus that contained about 20,000 concepts (countries, institutions, researchers' names, and subject areas) and 280,000 weighted relationships. In a concept-association experiment, the system-generated thesaurus out-performed four human experts in recalling relevant Russian computing concepts.

In (Chen, Hsu, Orwig, Hoopes, and Nunamaker, CACM, 1994) [3] and (Chen, IEEE Computer, 1994) [2], we tested selected algorithms in an electronic meeting environment where electronic brainstorming (EBS) comments caused the information overload and idea convergence problems. By extracting concepts in individual EBS comments, linking similar vocabularies together, and clustering related concepts, we were able to help meeting participants generate a consensus list of important topics from a large number of diverse EBS comments. In an experiment involving four human meeting facilitators, we found that our system performed at the same level as two facilitators in both concept recall and concept precision (two measures similar to the conventional document recall and precision). Our system, which ran on either a DECstation or a 486, accomplished the concept categorization task in significantly less time and was able to trace the comments which supported the concluded topics.

In a recent NSF-funded project, we built a (C. elegans) worm concept space using the literature stored in the Worm Community System (WCS) (Chen, Schatz, Yim, and Fye, JASIS, 1994) [7]. Our algorithms were implemented in ANSI C and ran on both SUN SPARC stations and DECstations. It took about 4 hours of CPU time to analyze 5,000 worm abstracts and the resulting worm thesaurus contained 798 gene names, 2,709 researchers' names, and 4,302 subject descriptors. We tested the worm thesaurus in an experiment with six worm biologists of varying degrees of expertise and background. The experiment showed that the worm thesaurus was an excellent ``memory-jogging" device and that it supported learning and serendipity browsing. The thesaurus was useful in suggesting relevant concepts for the searchers' queries and it helped improve search recall. The worm thesaurus was later incorporated as a concept exploration and search aid for the WCS.

As an extension of the worm thesaurus project and in an attempt to examine the vocabulary problem across different biology domains, we generated a fly thesaurus recently using 6,000 abstracts extracted from Medline and Biosis and literature from FlyBase, a database currently in use by molecular biologists in the Drosophila melanogaster-related research community. The resulting fly thesaurus included about 18,000 terms (researchers' names, gene names, function names, and subject descriptors) and their weighted relationships. In a similar fly thesaurus evaluation experiment involving six fly researchers at the University of Arizona, we confirmed the findings of the worm experiment. The fly thesaurus was found to be a useful tool to suggest relevant concepts and to help articulate searchers' queries.

Our initial comparison of the fly and worm thesauri revealed a significant overlap of common vocabularies across the two domains. However, each thesaurus maintains its unique organism-specific functions, structures, proteins, and so on. A manual tracing of fly-specific concepts and relevant links often lead to relevant, worm-specific concepts, and vice versa. We believe that by intersecting concepts derived from the two domain-specific concept spaces and by providing AI search methods we will be able to bridge the vocabulary differences between a searcher's (e.g., a fly biologist's) domain and the target database's (e.g., the worm database's) subject area. We are in the process of testing and fine-tuning several search algorithms (Chen and Ng, JASIS, 1994) [6] and we also plan to expand our subject coverage to other model organisms including e. coli, yeast, rat, and human in the near future. (Readers are encouraged to access the URL listed at the end of the paper for more information.)

Next: Research Plan for Up: Semantic Retrieval for the Previous: The Concept Space

hchen@bpa.arizona.edu / bshatz@ncsa.uiuc.edu