Research Plan for the NCSA Mosaic Digital Library Project

In this section, we review the recently awarded Illinois digital library project and our research plan relating to semantic retrieval and user customization.

A. The Illinois Digital Library Project: An Overview

In the world of the near future, the Internet of today will evolve into the Interspace of tomorrow. The international network will evolve from distributed nodes supporting file transfer to distributed information sources supporting object interaction. Users will browse the Net by searching digital libraries and navigating relationship links, as well as share new information within the Net by composing new objects and links.

The Illinois digital library project (PI: B. Schatz) includes two concurrent and complementary activities that will accelerate progress towards building the Interspace. These together construct a model large-scale digital library and investigate how it can scale up to the National Information Infrastructure.

Construction of a digital library testbed for a major university engineering community, in which a large digital collection of interlinked documents and databases will be maintained, software to browse and share within this library developed, and usage patterns of thousands of users spread across the Net evaluated.
Investigation of fundamental research issues in information systems, information science, computer science, sociology and economics that will address the scalable organization of a large digital collection to provide transparent access for a broad spectrum of users across national networks.

The testbed centers around the new Granger Engineering Library Information Center at the University of Illinois in Urbana-Champaign (UIUC). The $26M Center is intended as a showcase for state-of-the-art digital libraries and electronic information distribution. Construction of this national digital library testbed is possible through the active participation of two major institutions at UIUC, the University Library and the National Center for Supercomputing Applications (NCSA).

The digital library itself will be centered around a collection of engineering journals and magazines, obtained through collaboration with a range of major professional and commercial publishers. The intention is to attract a broad range of usage from a broad range of users. All documents will be structured and complete, that is, encoded in SGML and containing all pictorial material. The documents will include general engineering magazines (e.g., computer science from IEEE), specific engineering journals (e.g., aerospace engineering from AIAA), and specific scientific journals (e.g., physics from APS). Finally, articles from commercial engineering publishers (e.g., Wiley & Sons) will be collected for users in our economics (charging) study.

We plan to gather a significant new digital collection of structured documents in the engineering literature and combine this with existing sources available from our front end (Mosaic) and back end (BRS) software (discussed below). For example, these full-text materials will be integrated into an expanded on-line catalog including access to major periodical indexes in science and engineering (Current Contents, Engineering Compendex, INSPEC) which will be linked to SGML documents. Collections on the Internet will also be made transparently available, e.g., the physics preprints at Los Alamos, the Unified Computer Science Technical Reports at Indiana University, and the international collection of on-line library catalogs.

The testbed software will go through two primary phases within the proposal period (September, 1994-August, 1995). The goal of version 1 is to leverage off our substantial existing resources to build a functional digital library with a large collection used by a substantial user population. Concurrently during this period, the technology research will be developing significant new functionality (semantic retrieval and customized retrieval will be described below) and sociology research will be observing the significant usage patterns of the existing functionality. Together, these efforts will enable us to develop and deploy scalable digital library technology on a national testbed. The goal of version 2 is to demonstrate the advanced technical feasibility of a full functional Interspace system.

The version 1 software will evolve from two of our existing projects. The first is the existing information retrieval system in the current Grainger Library developed by co-PI Mischo. This is based on a PC front end to a full-text retrieval search from the major commercial vendor BRS. The frond end on this search engine will be the NCSA Mosaic software developed under the supervision of co-PI Hardin. In essence, version 1 will exhibit the browsing and searching capabilities currently available on several Mosaic-based servers, e.g., EiNet Galaxy, the World Wide Wed Worm, the JumpStation, NorthStar, and so on. However, the Mosaic-BRS software will allow access to a large collection of well-formatted and recent engineering literature.

This paper focuses on the proposed Information Science research, which centers around semantic retrieval and user customization, supervised by co-PI Chen. the semantic retrieval supports a higher level of abstraction in user search which can help overcome the vocabulary problem for information retrieval. Rather than searching for words within the object space, the search is for terms with in a concept space. Co-occurrence graphs seem to provide good suggestive power in specialized domains, such as biology. The research questions resolve around their effectiveness in the more general engineering domains. Using the same sort of techniques, it is possible to infer terms of interest to the users from the objects that have been retrieved. These techniques will be used to provide a form of customized retrieval, where a user profile consisting of terms and demographics specified by the users orients the search matching towards more preferred objects. In this project, the semantic retrieval and user customization will be used to supplement the full-text search and browsing in the testbed. Research plan for semantic retrieval and user customization are presented below.

B. Semantic Retrieval:

Based on our extensive experience in creating domain-specific concept space and supporting semantic, concept-based retrieval, we have found that the proposed techniques are robust and domain-independent and have shown great promise for supporting information retrieval in a large information space. We believe we are ready to employ the techniques experimentally on a larger and more general testbed collection and with a more diverse user population. Several of the proposed algorithms will be parallelized and implemented more efficiently on the NCSA machines (e.g., CM-5, SGI's Power Challenge, and Convex's Exemplar).

Our information science research plan will be based on an incremental and scale-up approach, starting from a few selected, focused scientific communities including molecular biology and physics and proceeding to testing in other general engineering and popular science domains. The research effort will coincide with the testbed collection process.

In the more restricted areas of molecular biology and selected physics domains, we will be able evaluate the concept spaces generated in detailed, controlled experiments. The effects of including concept spaces and the semantic retrieval functionalities in the UIUC digital library environment will be studied through ethnography and user surveys of a larger user population.

Several crucial research questions in the context of large-scale digital libraries will be addressed in this project. First, we need to examine the feasibility of the proposed techniques in more general and diverse domains. While the concept space approach has been shown to be useful in relatively restricted scientific domains and with a somewhat more uniform user population, i.e., research scientists, will the concept spaces generated remain robust and useful for more general application domains and can they be used by searchers of varying backgrounds (e.g., professors and school children)? Second, we need to address the scalability issue by testing our techniques' ability to support semantic retrieval in an even larger (terabyte) information space. We plan to parallelize selected algorithms with the assistance of NCSA and are already designing algorithms for incremental update and generation of concept space.

We believe, with the realistic testbed collection and large user population proposed in this research, we will be able to examine critically the issues surrounding ``intelligent," semantic retrieval in a genuine, large-scale digital libraries environment and develop scalable technology to help alleviate the information overload and vocabulary problems in information retrieval.

C. Customized Retrieval:

In the digital libraries environment, there is a critical need to create ``user models" which could aid in providing more customized information service and more focused and useful information sources and documents to individual users (e.g., a customized magazine for user X) and there is also a pressing need to know the retrieval patterns of different groups of users (e.g., what types of magazines and what subject areas are of most interest to the group of double-income, professional, suburban users?). The conventional manual approach to generating a user modeling component is infeasible because of the difficulty of achieving a complete and up-to-date knowledge base. However, the availability of large amounts of regular usage information (in the search logs) and the power of selected statistical and machine learning algorithms to analyze usage patterns may be able to provide a more robust and algorithmic solution to creating user models for IR.

The availability of a large testbed collection and user population presents a unique opportunity to research a knowledge discovery approach to user modeling in digital libraries. After the completion of a significant portion of the testbed collection (i.e., molecular biology, physics, and engineering) and the selection and identification of the testbed users (e.g., engineering faculty and students at UIUC), we can proceed to collect (log) the usage data and statistics of selected user group (200-500 target users) over a period of several months. Each logged search session will include information such as date searched, magazines browsed, articles retrieved, search terms used, search options selected, and so on. Each user also will be requested during their first log-in to provide detailed demographic information.

Upon completion of usage data collection, we will proceed to analyze individual usage patterns. For example, by analyzing the articles retrieved during numerous IR sessions conducted by the same user using the concept space approach described earlier, we will be able to generate a smaller but more user-specific concept space that best represents that user's interests. Such a concept space could be invoked for future retrieval sessions (to match with the system's underlying, bigger concept space) or be stored as the user interest profile and used to selectively route relevant new information to the user. Other usage statistics such as types of magazines browsed, months/days of heaviest retrieval activity, etc. can also be used to provide more customized service to the individual searcher in the future.

Following the individual usage analysis, a user group analysis will be based on the information provided in the demographic surveys, the usage patterns shown by the entire group of users, and statistical (e.g., regressions and discriminant analysis) and/or machine learning (e.g., ID3) based analyses in an effort to determine the critical information needs of different user groups. Results of such analysis may have a major impact on the practices of information providers (e.g., what types of advertisements should be placed in what kinds of magazines in order to attract the interest of targeted dual-career, suburban families?). A better profile and understanding of their main audiences could help individual information sources plan their marketing strategy.

In summary, we believe the digital libraries testbed collection and users proposed to be incorporated in this research present both a unique challenge and opportunity to study semantic retrieval and user customization in digital libraries. The results will have a potential impact on the practices of electronic publishers (e.g., IEEE, McGraw-Hill) and information retrieval service providers (e.g., UIUC engineering libraries or internet resource discovery).

Next: Author Biographies Up: Semantic Retrieval for the Previous: Our Experience in

hchen@bpa.arizona.edu / bshatz@ncsa.uiuc.edu