Searching heterogeneous multilingual bibliographic sources^*

Ling Cao, Mun-Kew Leong, Ying Lu and Hwee-Boon Low

Kent Ridge Digital Labs,
21 Heng Mui Keng Terrace, Singapore 119597, Singapore
caoling@krdl.org.sg, mkleong@krdl.org.sg and hweeboon@krdl.org.sg

Abstract: A crucial issue in the design of global Internet systems is the presence of many languages for data exchange, indexing, searching, and retrieval. It is especially significant in the library community where there is a trend towards incorporating multilingual bibliographic collections into regional and international databases for resource sharing through the WWW. Such systems are distributed and heterogeneous, with libraries using different proprietary systems and various flavours of standards such as MARC. This paper proposes an architecture for searching distributed heterogeneous multi-Asian language bibliographic sources, and describes a successful pilot implementation of the system.
Keywords: Distributed search; Multilingual retrieval; Bibliographic; Digital library; Z39.50

^*This project was driven by the National Computer Board, and delivered to the National Library Board with support from the National Science and Technology Board.

1. Introduction

There is increasing interest in establishing a service infrastructure for searching distributed heterogeneous multilingual (especially Asian language) bibliographic sources regionally and worldwide [1]. Users benefit in being able to issue a single query using a single interface and obtaining appropriate information from multiple sources instead of repeatedly search library after library each with their own idiosyncratic interface. Thus, issues in multilingual cataloguing, indexing and searching, interoperability and data exchange become more important. There are many character encoding schemes especially for Asian character-based languages. For example, Chinese script is variously encoded by such schemes as GB, HZX, Big5, EACC and Unicode. Handling character sets for multiple languages is a pervasive problem [2].

Additionally, the diversity of library sources and the variation of data formats make resource sharing difficult. Moreover, multilingual, especially Asian language, searching capabilities vary from system to system. Figure 1 shows two quite different examples of support for Chinese language and searching capabilities in bibliographic records.

Fig. 1. MARC records with different Chinese language support.

2. The CLib (Chinese Library) system

To demonstrate the searching of distributed heterogeneous multi-Asian language bibliographic databases, a pilot project, the Chinese Library (CLib) system (http://mentor.krdl.org.sg:8080), was developed in Singapore. In the first phase, language support was for Chinese bibliographic information, and two university libraries and one public library were selected as test sites. These libraries each used different library systems, different language encoding schemes, different implementations of MARC, and support different language searching capabilities (see Figs. 1 and 2).

Fig. 2. The CLib system overview.

3. System architecture

An extensible 3-tier client/server architecture was designed (Fig. 3) for searching heterogeneous multilingual bibliographic sources [3]. It comprises the following high-level modules: CGI program, language server, CLib server, Z39.50 client, CLib client and CLib multilingual search engine.

Fig. 3. The system architecture. Accepting multilingual query

The multilingual Web interface of CLib system allows the user to select his or her preferred encoding scheme (see Fig. 4) and ensures that the user has selected correctly. The system provides four fields for searching, specifically, title, author, subject and keyword (see the query form in Fig. 5). The support of unicode (UTF8) at the interface level provides the capability to accept and display multiple scripts beyond English and Chinese.

Fig. 4. Ensuring a matching encoding scheme.

Fig. 5. The Chinese bibliography search form. Beside receiving multilingual (including mixed language) queries from and returning the multilingual Web pages to the HTTP server, the CGI program calls the language server for converting the query from the incoming encoding scheme to Unicode and converting the dynamically generated Web pages back into the user selected encoding scheme.

Meta-data repository

The meta-data repository holds the format of each bibliographic source available to the system, i.e., the library servers. This meta-information includes server type, language and encoding(s), searching capability, services supported, data format, MARC mapping table and any other attributes. It is used to distribute the user queries to the right server in the right language and format, and also to merge the retrieved results and to extract title information from appropriate fields in the respective MARC records.

Distributing multilingual query

When the CLib server receives the query in some language supported in Unicode, or a mixture of any number of languages, it customizes the query, based on the meta-information, and concurrently dispatches to the selected library servers through the Z39.50 client or the CLib client. The CLib client is used to communicate with the CLib multilingual bibliographic database, running on top of an in-house Unicode-based search engine. This database provides advanced search and retrieval features and full multilingual searching in all bibliographic fields with the same functionality as the other Z39.50 search servers.

Merging heterogeneous results

The Z39.50 and CLib clients' search results from the respective libraries are received by the CLib server in the library's flavour of MARC and in their own internal encoding scheme. CLib uses Unicode internally and as an interlingua. The MARC records received from the various libraries are unified into Unicode and the corresponding directory and leader information are rebuilt. The results are then collated, merged, and if required, sorted (by user-selectable search fields sorting orders). Lastly the results are available in various levels of detail, e.g., Fig. 6.

Fig. 6. Collated and sorted titles.

4. Conclusions

We have presented an approach to searching distributed heterogeneous multilingual bibliographic databases. It provides the framework to distribute the multilingual queries and merge the retrieved search results from heterogeneous multilingual bibliographic databases. This approach makes use of the maximal multilingual searching capabilities of each distributed server and provides alternative mechanisms if none are available.

References