A WWW Interface to the OMNIS/Myriad Literature Retrieval Engine

Alexander Clausnitzer, Fakultät für Informatik, Technische Universität München
email: clausnia@informatik.tu-muenchen.de
URL: http://sunbayer35.informatik.tu-muenchen.de/~clausnia


Pavel Vogel, Fakultät für Informatik, Technische Universität München
email: vogel@informatik.tu-muenchen.de
URL: http://www3.informatik.tu-muenchen.de/public/mitarbeiter/vogel.html


Stephan Wiesener Bayerisches Forschungszentrum für Wissensbasierte Systeme (FORWISS), München
email: wiesener@forwiss.tu-muenchen.de
URL: http://www.forwiss.tu-muenchen.de/~wiesener/public/user.html

Abstract:
Cataloging and searching procedures in traditional library systems are expensive, time consuming and often incomplete. OMNIS is a novel multimedia information retrieval system for the administration of documents in libraries and offices. Using the fulltext database system Myriad combined with scanning and OCR technologies it offers the disclosing, archiving and searching functions at drastically reduced costs with much more precision. Documents may contain page images, full-length PostScript or other medial information and offer the user a much better insight into documents. At Technische Universität München a considerable number of computer science documents have been made searchable by a simple fulltext query language. To make this document retrieval system available to WWW clients the OMNIS document access function was implemented as OMNIS-WWW server which is already in operation. This paper contains the substantial features of OMNIS (especially its searching function) and discusses the concepts and the implementation of its WWW server.
The developement of OMNIS is promoted by DFG (Deutsche Forschungsgemeinschaft) and DFN-Verein (Deutsches Forschungsnetz)[6].
Keywords:
digital libraries, multimedia database systems, information retrieval, fulltext search, document archiving

Table of Contents

1 Introduction

The human knowledge still is mostly conserved and distributed by printing it on paper. Libraries and other institutions buy books, journals, etc. and try, with moderate success, to disclose their contents by manual indexing and classification. With about $40.00 per book the indexing and cataloging costs are high and consume the main part of the library budget. This is why most catalogs can offer incomplete bibliographic information only. An efficient use of these knowledge sources is connected with several problems: using manual catalogs literature search is time-consuming and often inaccurate. Beyond this, the selected documents are not immediately available and have to be ordered and physically transported to the reader.

The library system OMNIS/Myriad [4], [7] is a tool for an efficient document management. Its architecture consists of two layers: OMNIS as a library application and Myriad [20] as the underlying fulltext database system. OMNIS has been developed at the Computer Science Department of Technische Universität München and FORWISS München (Bayerisches Forschungszentrum für Wissensbasierte Systeme).

The two basic ideas of OMNIS are:

  1. Replacing the manual indexing and classification procedure by storing those parts of a document as fulltext which contain characterizing terms and phrases (abstract, table of contents, etc.). The document retrieval can then be done by fulltext queries.
  2. Storing at least the characterizing pages as pixel images or PostScript if available to offer it to the user in a formatted way which may contain additional graphical information and is much better legible than a plain ASCII text.
For archiving a document (or parts of it) it is scanned in as a pixel image. Then characterizing text and optional bibliographic data (title, authors, publisher, etc.) are converted by commercial OCR software into text format. The bibliographic information can easily be selected from the fulltext part via cut-and-paste and is stored into the OMNIS database together with picture and fulltext data. This way the archiving costs can be reduced to $1.50 per document.

After one year of operation at Technische Universität München (TUM) a considarable amount of documents has been made searchable in OMNIS digital libraries. The following table lists some of them (state Nov 94):

Table 1: OMNIS databases offered at TUM 
-------------------------------------------------------------------------------
Library        Contents                                                # Docs    
-------------------------------------------------------------------------------
ZentralBib     Central library       catalog                           195,000   
InfoMathBib    Department of Computer Science library catalog,         40,800    
               from Aug 93 on with abstracts, tables of content, and             
               scanned pixel images                                              
TUBibMue       Books, journals, conference proceedings, and tech       125,000   
               nical reports on computer science (1981-1992), bib                
               liographic data with abstracts, no pixel images                   
TechRep        Generated from BibTex files of computer science lit     209,000   
               erature from over 100 universities                                
BayerBib       Literature on database management systems, knowl        3,285     
               edge-based systems, and information systems in                    
               general, all with abstracts, tables of contents and               
               pixel images                                                      
MayrAlphaBib   Literature on theoretical computer science, partly      ~ 30,000  
               with abstracts, tables of contents, and pixel images              
WiederholdBib  Literature on database management systems, biblio       ~ 20,000  
               graphical data partly with abstracts, tables of con               
               tents, and images                                                 
-------------------------------------------------------------------------------
This makes a total of more than 622,000 documents. At TUM OMNIS became an integral part of any research activity. In November 1994, 337 different users made 2,156 literature search sessions. As we know, OMNIS databases in Munich are the most comprehesive collection of computer science documents and new publications are added daily. Compared to other literature retrieval systems, OMNIS documents may additionally contain BLOBs (Binary Large Objects) [17]. A BLOB may hold images, i.e., a few characteristic pages in some image format as described above, whole PostScript documents or audio/video data. To make the OMNIS literature search available to the whole science community a WWW-based OMNIS server was developed. WWW has been chosen because it is well-known to the science community. It can mediate access to OMNIS without expensive distribution of special retrieval clients. WWW supports documents with images, audio and video data and offers the best user interfaces of common wide area information retrieval systems. The URL of our OMNIS server is:

http://www3.informatik.tu-muenchen.de/cgi-bin/omnis

In chapter two an overview of the whole OMNIS library system and especially of its search facility is given. Afterwards the retrieval component is discussed in more detail. Chapter three describes the concepts of realization and the implementation of a WWW-based retrieval component. Chapter four is a short summary.

2 Structure of the OMNIS System

OMNIS has primarily been developed as an administration and retrieval tool for scientific libraries and other collections of scientific documents in paper form.

Figure 1: OMNIS system components and user groups

It consists of archiving, searching, and lending components. This partition is predefined by the three classes of systems users: people archiving documents, users performing literature search, and professional library personnel, responsible for library operation. The system's workflow is given in figure 1.

2.1 OMNIS/Myriad Architecture

OMNIS/Myriad is designed as a distributed system in a client/server architecture. Any number of OMNIS servers can be installed and each server may manage multiple document databases. A special OMNIS client offers retrieval and/or the archiving function to the user. A rough scheme of this architecture is shown in figure 2.

To realize a WWW-based literature search in OMNIS databases a special searching component has to communicate with the user's WWW clients and treat the OMNIS database servers as usual. Viewed from the side of the WWW system such a component can be called the OMNIS-WWW server. Viewed from the rest of the OMNIS system this component has to be located at the same level as the normal OMNIS searching component, i.e., as a client.

Figure 2: OMNIS client/server architecture

2.2 OMNIS Documents

An OMNIS document in the database always consists of some structure fields (i.e., bibliographic data like title, author, publisher, etc.) and a piece of text containing the characteristic expressions and phrases of the document (e.g., abstract, table of contents, etc.). The structure fields are subject to structure (relational) queries while the text is subject to (associative) fulltext queries. The contents of the structure fields are attached to the text automatically and so the words and phrases contained in the structure fields can be found by both structure and fulltext queries.

Additionally, a number of BLOBs containing medial information can be attached to an OMNIS document. Frequently used are image BLOBs containing scanned-in pages. These pixel images are the basis for an OCR procedure which produces the document's fulltext mentioned above. Further PostScript BLOBs containing complete documents may be attached. The presence of at least a few pages in original format proved to be essential for the users, not only because of possibly contained graphical information but also because of the clear, convenient and familiar outward appearance. Audio and video BLOBs are possible too. For a library application, however, only a few original documents with audio and/or video data are available yet. The medial information is stored as part of a document and can be presented to the user. BLOBs are not interpreted and thus, not subject to querying.

2.3 Document Retrieval

For efficient literature search and high acceptance by end users a simple and intuitive query interface is necessary. Such a tool is the OMNIS fulltext query language. In the following some examples give a rough idea of this language:

1.  gauss                               single word query

2. database machine phrase query

3. h_permedia% character wildcards

4. live cycle & (basili | cohen) boolean operators

5. #2 & boral reference to a former query

6. fulltext .. database word wildcards

7. ?author = `Boral%' structure field query

The first query delivers all documents containing the word "gauss" (or "Gauss") somewhere in their fulltext. The second one similarly searches for the phrase "database machine". In the third, sixth and seventh query wildcards stand for any single characters, character sequences or single words. Fulltext queries are simple to formulate and proved to have a good precision. Query four shows the usage of boolean operators to formulate more complex needs whereas succesive query refinement is possible by referencing former requests as shown in query five. Query seven results in documents whose structure field "author" begins with "Boral".

3 The OMNIS-WWW Interface

In the last two years WWW (World-Wide Web) has become one of the fastest spreading Internet applications. This success may have two main reasons:

  1. The prevailing hypertext systems seem to be restricted to one local or distributed file system. Existing wide area information retrieval and access systems like FTP [19], Gopher [3], Telnet, WAIS [1], Prospero, and x500 miss the hypertext easy navigation facility. The basic WWW idea, the merge of hypertext and wide area networking, has closed a gap perceived by many users. Additionally, using an addressing scheme which is compatible to the wide area systems above (with additional link facility), their documents are available too.
  2. WWW clients are freely available for most existing platforms and terminal types. On terminals with graphical user interface it provides a complete point and click interface of high user acceptance. The retrieved information pages may also contain images and other medial data. A WWW page can also be displayed on a plain text terminal with the only restriction that images are replaced with an alternative text.

3.1 Substantial WWW Features

WWW offers the retrieval of hypertext pages written in HTML (Hypertext Markup Language) [8] by the Hypertext Transfer Protocol (HTTP) [9]. A HTML hypertext page may contain links to any document in the WWW addressing space [10]. In addition to a simple document retrieval, HTTP includes several not yet generally used featuers for automatic accounting, user identification, and encryption.

Figure 3: An open OMNIS document displayed by XMosaic

The WWW client sends a request to a HTTP server which sends the requested data back to the client. Besides describing ordinary hypertext pages HTML is able to define forms, which can be filled out by the user. To use the entered data the HTTP server has to be able to process it.

Instead of mapping an URL onto a file in the local file system of the HTTP daemon, servers like cern-http from CERN and httpd from NCSA [18] have the capability to execute applications if certain URL addresses are requested. These gateway applications are ordinary executable programs which can be written in any programming language (e.g., C , Perl, or shell scripts). The communication between HTTP server and gateway application is defined in the CGI (Common Gateway Interface) definition [18]. Prior to the start of a gateway program, the HTTP server writes data about the client's request (e.g. origin, variables sent by the client, and requested address) into environment variables or StdOut of the operating system. These data can be read and processed by the gateway application to produce output in MIME format [11] which is sent back to the HTTP server's StdIn.

Improvements of HTTP and HTML for requests at library databases or for coding information like title, author, publisher, etc. are currently under discussion. A final standard is not to be expected in the near future.

3.2 Solution Alternatives

In contrast to client/server relationships in usual database systems HTTP only offers a stateless connection. No permanent connection between client and server during a complete session is established. This raises several problems for an OMNIS gateway to WWW.

For every incoming HTML request with database access a new database session has to be opened and closed again after the data transfer is completed. Because of OMNIS internal reasons the login and logout procedures are time-consuming functions. Opening and closing sessions for each single request would make efficient working with the WWW interface impossible. Nevertheless, HTTP will provide mechanisms to control virtual sessions in the future. These features, however, are not yet supported by most WWW clients.

A further problem of closing database sessions after every client's request is the difficulty of re-using information which already had been accessed in a former request. This requires that the actual session status must be stored in HTML forms of the type hidden or within URL adresses.

To solve these problems two solutions were under discussion:

  1. Development of a new OMNIS-HTTP server being based on a WWW library of common code developed by CERN. Because of OMNIS internal reasons for each OMNIS database an extra OMNIS-HTTP server with a different HTTP address has to be started. Each OMNIS-HTTP server is in permanent connection with its OMNIS database and translates incomming requests into database queries. The resulting data is then trasformed back into HTML and returned to the WWW client (see Figure 4).

Figure 4: Independent OMNIS-HTTP servers

  1. Using an existing HTTP server which calls gateway applications (see section 3.1). Such an OMNIS-HTTP gateway application consists of a stand alone OMNIS server for each database and a single CGI program. Each OMNIS server has a permanent connection to a single database within a permanent database session. The CGI program passes the WWW client's requests to one of the OMNIS servers which translate them into OMNIS queries for the database. The query results are sent back to the CGI program which transforms it into a WWW client readable format (see Figure 5).

Figure 5: OMNIS-HTTP gateway application

The advantage of the first solution is that all attributes of the HTTP protocol like user-authorization or multi-language support can be utilized. Additionally there is no need for a further OMNIS server. The disadvantage is the already mentioned need for different OMNIS-HTTP servers with own addresses for each database.

The second solution offers the capability to use one gateway program with the same base URL address for all databases. In the future an automatic load balancing of OMNIS servers for one database would be easy to implement. Existing HTTP servers already have built-in features for auditing and user authorization and it is not necessary for the application to deal with HTTP protocol details. Unfortunately the current CGI standard offers only limited information about the client's request to the gateway application. Within the next years the HTTP protocol will be improved. Since the OMNIS gateway program is a standard CGI application it will still be able to work with improved HTTP servers which can use new features like data encryption and automatic accounting.

Finally, the second solution was implemented [12].

3.3 Solution Concepts and Implementation

A major point in the realization of the OMNIS-HTTP gateway was to make it easy to customize the pages and forms for different types of OMNIS databases. Since many databases should be available via WWW the gateway application has to provide easy methods for the administrator to add and change database parameters. For these purposes a special Mask Definition Language was developed. The masks consist of the fixed HTML fragments which are filled by the gateway application and returned to the client as request results. These fragments also contain hidden queries for the OMNIS server and control commands for the CGI program.

Retrieval requests from WWW clients are handled as follows (see Figure 6):

  1. The HTTP daemon recieves a HTTP request (e.g., a fulltext query) from a client. If the URL refers to a CGI program, it is executed (e.g., for query processing). The communication takes place as described in section 3.1.
  2. The CGI program analyzes the client's request and reads a global configuration file. This file contains OMNIS server adresses and defines how resulting HTML pages are to be generated.
  3. This result is produced in one of three different ways, depending on the client's type of request:
The requested page is not really generated by the gateway programm. A line in the configuration file is a reference to any existing page in the WWW adressing space. The gateway program causes the HTTP server to send a command to the client to load this specific page.
The requested page is generated by the gateway program. A line in the configuration file points to a mask which contains commands for the CGI program or queries for OMNIS servers. These masks have to be accessible in the filesystem and are parsed by the CGI program. Encoded queries and commands are executed. The resulting output is a HTML page which is transferred to the HTTP daemon and sent back to the client.
The requested page delivers binary data like an image, postscript, or plain text. These data are loaded from the OMNIS server and are transformed into common formats by the CGI program. The result in MIME format is given to the HTTP daemon and sent back to the client.
  1. The client presents the recieved data.

Figure 6: Interplay of HTTP server, CGI program, and OMNIS server

Demands on the mask definition language:

The concept of the mask definition language should allow reading and editing mask pages with ordinary HTML browsers or editors.

Not to be limited to the maximum command length and syntax of SGML (Structured Grammer Markup Language) [15] which HTML is based on, it was necessary to hide commands for the OMNIS gateway application in SGML comments. For this reason these commands start with <!-- OMNIS and end with -->. To extract information from the client's request (e.g., query string, query ID, database name, etc.), all data delivered by HTTP GET or POST requests and parts of the URL address are stored in internal variables of the OMNIS gateway application. These are used or modified by commands and can be inserted into any place of the resulting HTML document.

There are commands for the OMNIS server to start a query, get structure field information for one specified document or documents found by one specific query, get document-text, get a list of existing pictures and get other binary data.

In contrast to the CGI program the OMNIS server is only started once for each OMNIS library and handles incoming requests sequentially. The gateway program is executed for each client request. With fast following HTTP requests the CGI program can be started several times by the HTTP server to run paralell. To ease the OMNIS server, time-consuming functions like transformation of image formats from an OMNIS internal format to GIF is done by the CGI program.

4 Summary

In this paper the problem of high indexing costs in traditional library systems was disclosed. A short overview of the OMNIS digital library system's architecture was given and it was explained how OMNIS document archiving and retrieval means solve the above problem.

During the last years, many OMNIS databases have been filled with huge amounts of scientific literature. To make these information sources accessible for the growing community of World-Wide Web users an OMNIS-WWW gateway was developed and is in operation today. The implementation is based on the CGI (Common Gateway Interface) definition for HTTP servers. A CGI program is connected to a HTTP server as well as to several OMNIS database servers. It distributes incomming client requests to OMNIS servers and returns their query results back to the HTTP server.

5 References

[1]
B. Lincoln: WAIS Bibliography, Technical Report, Thinking Machines, ftp://quake.think.com/pub/wais/wais-discussion/bibliography.txt, August 1991
[2]
Adobe Systems Inc.: PostScript Language: Reference Manual. Adobe Systems Inc., 1991
[3]
Anklesaria F., et. al.: The Internet Gopher Protocol, Internet RFC 1436, March 1993
[4]
Bayer, R.: OMNIS/Myriad: Electronic Administration und Publication of Multimedia Dokuments. In: Informatik, Wirtschaft und Gesellschaft. 23. GI-Jahrestagung, Springer Verlag, Dresden, 1993
[5]
R. Bayer, W. Kowarschick, Ch. Roth, P. Vogel, S. Wiesener: OMNIS/Myriad on its Way to a Full Hypermedia System, EITC 94 (European Information Technology Conference), 1st Workshop on Human Comfort and Security, June 1994, Brussels
[6]
Bayer, R., Vogel, P., Göttsch, H.: The Munich Metropolitan Area high Speed Network of digital Libraries, Internal Report, Technische Universität München, Munich, 1994
[7]
R. Bayer, P. Vogel, S. Wiesener: OMNIS/Myriad Document Retrieval and Its Database Requirements, DEXA 94 (Database and Expert Systems Applications), 5th International Conference, Proceedings, Springer Verlag, 1994
[8]
Berners-Lee T.: Hypertext Markup Language (HTML), http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html, January 1993
[9]
Berners-Lee T.: Protocol for the Retrieval and Manipulation of Textual and Hypermedia Information, http://info.cern.ch/hypertext/WWW/Protocols/HTTP/HTTP2.html, 1993
[10]
Berners-Lee T.:Uniform Resource Locators, http://info.cern.ch/hypertext/WWW/Addressing/Addressing.html, January 1993
[11]
Borenstein N., Freed N.: MIME (Multipurpose Internet Mail Extensions): Mechanisms for Specifying and Describing the Format of Internet Message Bodies, Internet RFC 1341, June 1992
[12]
Clausnitzer, A.: Realization of WWW and ASCII Interfaces for the OMNIS Retrieval Component, Master Thesis, Technical University of Munich, November 1994
[13]
Conolly, D.: HyperText Markup Language, Internet Draft, July 1993
[14]
Frystyk H., Lie H. W.: Towards a Uniform Library of Common Code, The Second International WWW Conference, Chicago, 1994
[15]
International Organization for Standardization: Information Processing - Text and Office Systems, Standard Generalized Markup Language (SGML), ISO 8879 1986, Geneva, 1988
[16]
Kantor P., Lapsley P.: A proposed standard for the transmission of news, Internet RFC 977
[17]
Meyer-Wegener, K.: Multimedia Databases. B. G. Teubner Verlag, Stuttgart, 1991
[18]
NCSA: NCSA httpd, http://hoohoo.ncsa.uiuc.edu/docs/, 1995
[19]
Postel J., Reynolds J.: File Transfer Protocol, Internet RFC 959, October 1985
[20]
TransAction Software GmbH: Myriad System and Administration Guide, Munich, 1993