Robert C. Stern,
This paper describes the OmniPort system, under development for the Defense Technical Information Center (DTIC) by the Advanced Decision Systems group of Booz„Allen & Hamilton. This first section discusses the linked problems of data access in a networked environment and information retrieval from the masses of data available. The second section describes the Minerva middleware. The third section describes the Mosaic environment being developed to provide consistent user access to the data that Minerva retrieves. Finally, a fourth section discusses future growth plans and opportunities.
1.1 Data Access
The emergence of the Internet as a common means of information exchange has led to the creation of increasingly powerful tools for locating and retrieving relevant data. These tools have included archie, veronica, gopher and the Wide Area Information Service ([1] Each of these tools, however, requires the data to be organized (or re-organized, in the case of legacy data) into the format used by the access tool. This process can be expensive, time consuming and always carries the risk of data loss or corruption during the conversion process. Using WAIS makes sense when the data has not previously been made available using a searchable online system. However, when the data is already organized and accessible to local users via a modern database or textbase manager, it is wasteful to rehost the data in order to make it available on a wide area network (WAN).
While the amount of data accessible using gopher, WAIS or the World Wide Web (WWW) is increasing rapidly, the percentage of data accessible via a standard Internet access tool is small when compared to the amount of relevant data that exists in organized, searchable form on network accessible computers. These existing legacy systems are the product of decades of information gathering and data analysis. The problem of accessing this vast store of legacy data has until recently resisted practical resolution. Either users were forced to learn multiple search interfaces or a costly time-consuming data conversion effort was required.
Effective tools must be developed to help the end user extract meaningful information from the masses of data now or soon to be available on the networks. One such tool is Mosaic, a toolset that uses the WWW hypertext paradigm to browse the Internet. More importantly, it introduces the concept of linked analysis tools that can be used to view and, ultimately, to manipulate any data that is retrieved. The set of helper applications that can be defined for any Mosaic client forms the core of this set of linked tools. Each retrieved document is displayed in the manner that makes it most understandable and usable.
The hypertext browsing approach however does not solve the problem of selecting relevant documents from the many thousands of pages that can be accessed. This problem can only be addressed by a query capability, designed to help the user at both ends of the query process. First, by assisting the user in formulating effective queries and, then, by providing relevance ranking of returned documents, help can be provided for inexpert or infrequent users. Individual query systems sometimes provide these capabilities, but only on a limited scale and only to homogeneous information sources. The Internet environment requires that these capabilities operate across multiple dissimilar, geographically-distributed data sources.
Figure 1 - OmniPort Communications Architecture
„ Text Reference Language (TRL) is the language that encodes user queries in a uniform manner. The name is something of a misnomer, because the queries possible within the definition of TRL will retrieve any data, not just text documents. TRL comprises a superset of the search operators offered by existing text search engines. It allows users to generate queries of the maximum possible richness and power.
„ Metalanguage adds a layer of support for the full range of commands and responses possible within OmniPort, of which the queries formed in TRL are just one example. Besides TRL queries, the metalanguage syntax supports the definition of requests for documents, for highlighting within a document and for drone initialization. Essentially, any command necessary for the various OmniPort processes to keep each other informed is definable in the metalanguage.
„ Transport Language defines the low-level communications layer necessary for distributed processing. The transport language syntax permits the encapsulation of metalanguage commands with necessary packeting, status and routing information so that each OmniPort process can identify the messages on which it must act.
„ Desktops that offer the user a GUI supporting the formation of queries, the display of results and the analysis of documents in the results set. (It should be noted that the desktop is the only part of this architecture the user actually 'sees'. The rest is effectively invisible.) The OmniPort desktop is the Mosaic interface described in Section 3.
Figure 2 - OmniPort Process Architecture from a User's Perspective
„ Distributed Information Operating Environment (DIOE) that provides the transport backbone for the communication between the architectural components. The primary components of this structure are a network of dispatcher processes that manage the passing of messages (related to concepts, queries, document lists, documents, etc.) around the network of distributed resources.
„ Query Augmentation Services that assist users by broadening queries. Query augmentation offers several methodologies for expanding a user's query to increase the number of relevant documents retrieved. Once the documents are retrieved, the query augmentation services provide relevance ranking to assist the user in identifying the most relevant documents.
„ Drones that integrate native search capabilities by translating between OmniPort's TRL and the native search engine's query language, as well as providing other services, such as the highlighting of retrieval terms in the document.
This architecture provides a large number of benefits. Among the most important is fault tolerance. The Internet (or any distributed environment) is in a constant state of flux. Sources come on-line and go away at irregular intervals. Minerva's DIOE is designed to respond to both the disappearance and reappearance of sources in the appropriate manner. When a source goes off-line, the associated drone informs its associated dispatcher which broadcasts that information to all other dispatchers. Similarly, the reappearance of a source is noted, broadcast and that source is then included in all future transactions.
Equally important is the ease with which new sources can be included in the architecture. Drones are specific to a search engine, not to a source. If a new source with a search engine for which a drone already exists is added, it usually requires only the installation of the appropriate drone code. (This is true for text sources; incorporating a new structured database also requires the creation of a table that maps field names.)
Another important benefit of Minerva is its query augmentation services. To understand the importance of a query augmentation service, it is necessary to define two measures of the quality of a retrieval: recall and precision. The recall measure compares the number of relevant documents retrieved by a search to the total number of available relevant documents. The result is expressed as a percentage. For example, if the total number of available documents that relate to a user's topic of interest is 100 and an attempted search by a user returns 25 of those documents, the recall is 25%. The greater the number of documents that should have been included that actually are included, the higher the recall. The precision measure compares the number of relevant documents in a retrieved set to the number of irrelevant documents in the same set. Precision is also given as a percentage. For example, if a search retrieves 100 documents and only 30 are actually relevant to the user's needs, the precision is 30%. The higher the precision, the fewer irrelevant documents are in the response set.
How do these measures apply to an actual search? If a researcher is interested in engine damage, and submits a query in the form '(AND "engine" "damage")', the retrieval set will not include documents that mention "compressor stage damage" despite the fact that such documents are relevant. Thus, the search would be lower in recall to the extent that these relevant documents were missed. The search may include documents that contain the search words "engine" and "damage" but do not actually discuss engine damage. For example, a document containing a phrase such as "the engine performed well but the left wing sustained damage" would be included but not relevant. Thus, the precision would be lower.
To get around this problem, OmniPort is designed to work with multiple query augmentation services. The query augmentation services expand a user's query into a broader set of related search patterns, thereby increasing the likelihood that the query will retrieve a broader set of relevant documents. The disadvantage of a broadened query is that, while it increases recall, it runs the risk of decreasing precision. There are two primary means in which a query augmentation service can combat this potential loss of precision. First, it can broaden the query in a smart fashion, increasing precision by using domain knowledge to assure that added search criteria are truly relevant to the users' desired goals. Second, it can rank the retrieved documents by relevance, so that users are easily able to identify those documents that are most likely to contain relevant data. Ideally, a query augmentation service should do both. Any time penalty associated with the added processing of a query augmentation service is more than made up for by the greater likelihood that the user will obtain the desired response set using fewer queries and with less time wasted browsing irrelevant documents.
A number of possible approaches can be taken to query augmentation and the subsequent relevance ranking. Most are based on a thesaurus, so that the individual words in the user's text pattern are each expanded by including synonyms for the words in the search set. Relevance ranking, in these cases, is done mainly by calculating the percentage of search terms that were found in any particular document. In developing OmniPort, Booz„Allen incorporated a sophisticated, knowledge-based query augmentation capability known as RUBRIC. [3]
A domain knowledge base contains concepts organized in a semantic network. Each concept is defined by a set of attributes and their values. It may also include a set of subconcepts and optionally a set of evidence (text patterns that indicate the presence of the concept in a document). Each concept includes weightings that specify how it relates to adjacent concepts and how a set of evidence contributes to relevance assessment. The set of these concepts constitutes the knowledge base that is made available to users for the formulation of queries.
A query composed of high-level domain concepts is successively decomposed by RUBRIC into lower and lower level domain concepts using the linkages in the knowledge base. At each level, any evidence patterns that may be present are formed into individual queries, along with the relevance weighting determined by the domain experts. These queries are then broadcast to all selected information sources. Like a thesaurus, the evidence set in the knowledge base contains all the various ways that a concept is likely to be referenced in the media, such as all the different synonyms or alternate spellings of a word, or the values a particular field may contain. Unlike a thesaurus, the evidence set is highly focused by the domain experts to assure a precise retrieval.
Figure 3 - OmniPort Test Home Page
When each of the individual queries gets a response from the information sources, RUBRIC gathers each response, performing an ACCRUE function on the set of evidence weightings for each pattern that 'hits' on a particular document. An ACCRUE function calculates a relevance ranking by giving a higher score to documents with more or better quality pattern matches. RUBRIC then reports the response set to the user with the appropriate relevance ranking associated with each document.
Figure 4 - OmniPort Concept Query Form
Selecting the "Open" button with the "Concept" radio button pushed causes a form to display that allows the user to formulate a concept search. The user can select from a list of concepts generated in real-time from the knowledge base. Once a concept has been selected, the user can choose from a list of information sources, similarly generated in real-time. (All OmniPort displays except for the home page and information pages, such as the help screens, are generated from code running on the server.) Figure 4 shows the OmniPort Concept Query Form with the concept "Composite Armor" and a WAIS database of SURVIAC documents selected.
The results obtained by that query are shown in Figure 5. Twenty-two documents were found in the WAIS source that matched some or all the search patterns associated with the concept, "Composite Armor". The ACCRUE algorithm calculated a relevance ranking based on the number of patterns that matched each document and the relevance score associated with each pattern. The resulting score is displayed next to the title of each document. To retrieve a particular document, the user needs only to click on the title, which is a hyperlink to the document display page.
Figure 5 - OmniPort Query Results Page
[2] Directory of Department of Defense Information Analysis Centers, DTIC, Alexandria, Virginia, Aug. 1993.
[3] Richard A. Tong and Appelbaum, Lee, "Conceptual Information Retrieval from Full-Text" in Proceedings RIAO-88 - User-Oriented Context-Based Text and Image Handling, MIT, Cambridge, Massachusetts, March 1988.
6.0 Authors' Biographies
Shelley G. Ford
sford@dgis.dtic.dla.mil
Ms. Ford has over 15 years of experience in both government and private industry developing information products and services for professionals. She is currently the Chief of the Information Analysis Branch, within the Research, Development, and Acquisition Support Directorate of the Defense Technical Information Center, where she serves as the project manager for OmniPort development. Ms. Ford has a degree in English and a Masters degree in Library and Information Science from the University of Maryland. In addition to OmniPort, the Information Analysis Branch is creating Mosaic Home Pages to assist Department of Defense officials in locating information and to disseminate selected information to the DoD community.