Putting Legacy Data on the Web:
A Repository Definition Language

Leon Shklar, Kshitij Shah, and Chumki Basu

Bell Communications Research, 444 Hoes Lane, Piscataway, NJ 08854 and
Computer Science Department, Rutgers University, New Brunswick, NJ 08903
shklar@cs.rutgers.edu

Table of Contents

Abstract:
The objective of InfoHarness [shk94] is to provide integrated and rapid access to huge amounts of heterogeneous legacy information through WWW browsers. This is achieved with the help of metadata that contains information about the type, representation, and location of physical data. The proposed InfoHarness Repository Definition Language (IRDL) aims to simplify the metadata generation process. It provides high flexibility in associating typed logical information units with portions of physical data and in defining relationships between these units. The proposed stable abstract class hierarchy provides support for statements of the language that introduce new data types, as well as new indexing technologies.

1.0 Introduction

The InfoHarness (TM) system [shk94] has been designed to provide rapid access to huge amounts of heterogeneous information without any relocation or restructuring of data. We have developed and synthesized the ideas of other researchers [and94, boh94, fis91, gar93, gro94, hsu91, jai94, kiy94, sho93] to use metadata for providing advanced search and browsing capabilities without imposing constraints on information suppliers and creators. We introduce the InfoHarness Repository Definition Language (IRDL) to simplify the generation of metadata used to encapsulate existing information. To support IRDL statements, we have proposed a stable abstract class hierarchy. This hierarchy need not be modified to define terminal classes that accommodate new types of information or utilize new indexing technologies.

IRDL's objective is to combine the flexibility of physical data object encapsulation with the convenience of a high-level language. Two main components of the language are responsible for introducing new data types and for defining structures of information repositories. The type definition component of the language is still under development and is not discussed in this paper. The structure definition component allows users to impose logical interpretations on physical data. In the current version of IRDL, the structure definition statements may either refer to types in the IRDL type library or to the MIME (Multipurpose Internet Mail Extensions) types.

The InfoHarness prototype is now operational and on trial at Bellcore for building software repositories, accessing geo-spatial data, and a variety of other applications. It provides access to the original information from Mosaic and other World-Wide Web (WWW) browsers through an HTTP gateway. The IRDL interpreter that converts statements of the language into metadata entities representing InfoHarness repositories is under construction.

The organization of the paper is as follows. In Section 2, we discuss the role metadata plays in the transparent access to new and existing information. In Section 2.1, we discuss the organization of metadata. In Section 2.2, we present the stable and extensible class hierarchy, which allows subclasses to inherit general methods from their superclasses and to customize them for their own use. In Section 3, we discuss the automatic generation of metadata entities that represent InfoHarness repositories. In Section 3.1, we define low-level metadata generation commands. We then argue that it is impractical to manually write down these commands for a real application. In the following section, we introduce the specification of a high-level repository definition language. In Section 3.3, we present an example of applying this language to building repositories of C programs. Section 4 is devoted to related efforts. Conclusions and our plans for future work are presented in Section 5.

2.0 InfoHarness Repositories

The most important feature of InfoHarness is its ability to provide access to information without changes in the location and representation of data. This is achieved by creating metadata and associating it with physical information. Metadata for different media types is often defined as derived properties of the media, which are useful in information access or retrieval [che94]. An InfoHarness Repository does not contain the physical information itself but is composed of metadata entities that are described in Section 2.1. The generation of these entities is controlled by the structure definition statements of IRDL. In the following subsection we present the InfoHarness class hierarchy and its properties, which are crucial for the extensibility of the system and the future type definition component of the language.

2.1 Structure of Metadata

Metadata entities, which encapsulate units of physical information of interest to end-users, are called information units (IU). An IU may be associated with a file (e.g., a man page), a portion of a file (e.g., a C function or a section of a paper), a set of files (e.g., a collection of related bitmaps), or a request for the retrieval of data from an external source (e.g., a database query).

An InfoHarness object (IHO) is defined recursively as one of the following (Figures 1 and 2).

IHOs have unique object identifiers that are recognized and maintained by the system. Each IHO (with exception of collection IHOs) stores the location of physical data and the data retrieval method, as well as additional parameters needed by the method to separate the relevant portion of information. For example, an IHO associated with a C function will contain the path information for the C file, the name and location of the program that knows how to separate out a function from a C file, and the name of the function to be passed to this program. In addition, each IHO may contain an arbitrary number of attribute-value pairs (e.g., owner, last update, security information, decompression method). An IHO that consists of a single IU is called a simple object.

Both collection IHOs and composite IHOs reference a set of other IHOs. Each IHO contains unique object identifiers of all set members (its children). An IHO that contains both an IU and a reference to a set of other IHOs is called a composite object. An example of the composite object is this paper's abstract combined with the set containing postscript, HTML, and plain text versions of the full paper.

Collection IHOs may also point to independent indices that reference members of the encapsulated sets. We will further refer to such objects as indexed collections, and say that an IHO belongs to an indexed collection if it is a child of the collection object. Indexed collections store information about the location of both the index and the query method. Any collection IHO may make use of its own data retrieval method that is not part of InfoHarness. As a result, an InfoHarness Repository may easily be created from existing heterogeneous index structures.

An InfoHarness Repository (IHR) is a set of IHOs that are known to a single InfoHarness server. These IHOs are not necessarily pointed to from any single IHO. An IHO may be a child of any number of collection objects (its parents). Each IHO that has one or more parents always contains unique object identifiers of its parent objects. An IHO that does not have any parent is unreachable from any other IHO and may only be accessed if it is used as an initial starting point (or entry point) in the IHR traversal.

2.2 InfoHarness Class Hierarchy

This section describes the InfoHarness class hierarchy (Figure 3). The Abstract Classes provide method sharing between groups of terminal classes. The stability and extensibility of the hierarchy is crucial for the type definition component of IRDL, which is used to define new terminal classes. The structure definition statements of the language would refer to these classes as types. Interpreting the statements would result in instantiating the terminal classes as InfoHarness objects. Examples of Terminal Classes include encapsulators of external viewers, as well as indexing technologies like Wide-Area Information Service (WAIS) [kah91] and Latent Semantic Indexing (LSI) [dee90].

The abstract class hierarchy shown in Figure 3 is constructed with a dual objective:

  1. To represent heterogeneous data and utilize various indexing technologies.
  2. To support server-side and client-side run-time processing of information.
The distinction between the Server Processing class and the Client Processing class is based on the differences in processing information, while the rationale for their subclasses is in the data representation. The abstract classes are discussed in subsections 2.2.1-2.2.3. The definition of new terminal classes is discussed in subsection 2.2.4.

2.2.1 Server Processing

For instances of terminal classes that are subclasses of the Server Processing abstract class, run-time access to the data is provided by running processes on the server. The subclasses of this class are the abstract classes Server Formatted Data and Executable. The subclasses of Server Formatted Data serve to access physical data by running external viewers. The abstract class Executable represents IUs that encapsulate application programs.

Terminal classes are not part of the InfoHarness implementation and are determined by the type-definition statements of IRDL that reflect design choices made by InfoHarness administrators. For example (Figure 3), a terminal class for man pages may be defined either as a subclass of Server Formatted Data or as a subclass of Client Formatted Data. The choice may depend on the availability of data, browsing tools, and the security considerations.

2.2.2 Client Processing

For instances of terminal classes that are subclasses of Client Processing, data is first transferred to the client and then processed. Most of the data types may be defined by instantiating either a subclass of Server Processing or a subclass of Client Processing. The exceptions are Audio and Text that can not be defined as subclasses of Server Processing.

The special treatment of audio files is determined by the need to play a recording on a client machine for it to be heard by an end-user. Data of this type is first transferred to a client machine and then directed to /dev/audio (for UNIX). For instances of Text the information is passed on directly to HTTP clients. It is a natural choice for representing plain text or html documents.

2.2.3 Collections

Instances of the Collection class are not directly associated with physical information. Instead, they represent relationships between IHOs. Instances of the Indexed Collection class are associated with an external index (Figure 2) that is used at run-time to select members of the collection. Instances of the Indexed Collection class are presented to an end-user as a query interface, while instances of Non-Indexed Collection are presented as a full list of members.

2.2.4 Type Extensibility

The InfoHarness class hierarchy is open in that new terminal classes may be defined to accommodate the vast variety of information. In the future, new terminal classes will be defined from abstract classes using the type definition component of IRDL. Currently, the new types are defined by customizing methods inherited from abstract classes. The abstract classes are permanent and form the core of our hierarchy, whereas the terminal classes are defined when necessary to meet specific requirements.

Similarly, new indexing strategies are supported by defining new terminal classes. The only additional complication is in maintaining the mapping between two member identifications: the one known to the indexing algorithm and the one known to InfoHarness (Figure 2). It is fairly easy to support WAIS, LSI, and similar technologies by defining appropriate terminal classes.

3.0 Metadata Extraction

This section describes the creation of InfoHarness repositories by generating metadata entities based on both user requirements and physical data. As defined in Section 2.1, each InfoHarness Repository, or IHR, is a set of IHOs, some of which encapsulate physical data, and some of which represent sets of other IHOs. The creation of an IHR amounts to the generation of metadata entities that represent IHOs and to the indexing of physical information encapsulated by members of indexed collections.

The IHR creation is controlled by the IRDL Interpreter, which evaluates IRDL statements and issues commands for the Generator. The latter produces metadata entities that encapsulate and group together physical data.

In Section 3.1, we briefly discuss the metadata Generator and its commands. In Section 3.2, we introduce the specification of IRDL, which supports writing simple and concise definitions of information repositories. In Section 3.3, we discuss examples of applying IRDL to C programs.

3.1 InfoHarness Generator

In this section we discuss the Generator commands that control the creation of metadata entities, which either encapsulate physical data, or group other encapsulations into sets. In addition, the Generator is responsible for the creation of physical indices that reference members of indexed collections.

There are only three different Generator commands:

  1. The encapsulate command requires information about type and location of physical data. It returns a set of IHOs, each of which encapsulates a piece of data. Boundaries of these pieces are determined by the type. For example, an encapsulation command may refer to the type rmail and the location of an RMAIL file. The output in this example is a set of IHOs, each of which is associated with a separate mail message.
  2. The group command requires a set of pointers to individual IHOs and, optionally, the desired type of the index. The command generates an IHO associated with the collection, as well as parent-child and child-parent relationships between the collection IHO and each member of the input set. The optional type parameter determines the indexing technology to be used for indexing physical data associated with member IHOs. If the type is not specified, no index is created.
  3. The merge command requires an IHO and a set of references to additional IHOs. It produces a composite IHO that encapsulates the same physical data as the input IHO and contains the mentioned set of references.
The Generator does not have any flow control commands because that is determined by the Interpreter. It is possible to use the Generator on its own, but to manually write down all commands for even a simple application would be impractical. In reality, the Generator commands are produced by the IRDL Interpreter.

3.2 IRDL Specification

The primary objective of IRDL is to provide help in building information repositories. It combines the usual features of a structured programming language with the non-procedural support for data encapsulation, set operations, and content-based indexing. In this section we describe the proposed language constructs and explain the rationale behind them. Section 3.3 will then provide some concrete examples of using IRDL to create repositories of C programs.

3.2.1 IRDL Programs

Any IRDL program is a sequence of declarations followed by a sequence of statements. A legal program must have at least one declaration and one statement.

<program> := begin; <body> end; <body> := <declarations> <statements> <declarations> := <declaration> | <declaration> <declarations> <statements> := <statement> | <statement> <statements>

3.2.2 Declarations

A declaration may be a type declaration or a variable declaration.

<declaration> := <type_declaration> | <variable_declaration> Type declarations are required for both collection types and data types. Collection types are associated with particular indexing technologies (WAIS, LSI, or user-defined). The declaration of data types (e.g., TXT, C) helps to utilize proper data encapsulation methods. Of course, all declared types must be available from the type library.

<type_declaration> := COLLTYPE <collection_types>; | DATATYPE <data_types>; <collection_types> := <collection_type> | <collection_type>,<collection_types> <collection_type> := SET | WAIS | LSI | <user_defined_type> <data_types> := <data_type> | <data_type>,<data_types> In variable declarations, the SET qualifier determines whether the interpreter returns a handle to a single element or to a set of elements. If the handle identifier represents a set, it may be iterated upon using iteration statements that are discussed later in this section. The language supports two built-in types: IHO and STRING. At this time, values of individual attributes are always treated as strings, therefore, we did not introduce any arithmetic types.

<variable_declaration> := VAR [SET] <built-in> <vars>; <built-in> := IHO | STRING <vars> := <var> | <var>,<vars> <var> := <item> | <set> <item> := <iho> | <string> <set> := <iho_set> | <string_set>

3.2.3 Statements

A statement may be a variable assignment, a set iteration, or input/output. Assignments may be performed on individual elements and sets of elements. Selective access to individual set elements is provided by forall statements. Simple input/output directives promote the reuse of existing metadata entities.

<statement> := <assign> | <forall> | <input_output>
Assignment Statements
Assignment statements support the assignment of individual items and sets of items, where an item is either an IHO or a string. It is only possible to assign set expressions to set variables and individual items to item variables. IHO expressions may only be assigned to IHO variables and string expressions may only be assigned to string variables.

<assign> := <var> = <expression>; <expression> := <item_expression> | <set_expression> <item_expression> := <item> | <iho_expression> | <string_expression> <set_expression> := <set> | <iho_set_expression> | <string_set_expression> | <set_union> | <set_intersection>
IHO Expressions
As defined in Section 2.1, the encapsulation of physical data is performed through information units, which are contained within simple and composite InfoHarness objects. The encapsulation process is controlled by type methods, where the types represent the desired interpretation of physical data. For example, a C file may be treated either as a text file or as a C program. The encapsulation is performed through the encapsulate statement of the language. It requires the desired interpretation, and either the location of physical data or a previously defined IHO (of different type), which encapsulates this physical data.

While simple objects may be created directly from information units, a collection object may only be created from a set of objects. Creating a composite object requires both an information unit and a set of other IHOs. Consider the example of representing a C program. One alternative is to create composite IHOs that encapsulate C files and reference simple IHOs, which encapsulate individual functions that occur in these files. The creation of composite objects is performed through the combine statement that takes an IHO and a set and returns a composite object.

<iho_set_expression> := <encapsulate> <iho_expression> := <combine_expression> | <index_expression> <encapsulate> := ENCAP[SULATE] <data_type> <location> | ENCAP[SULATE] <data_type> <iho> <combine_expression> := COMBINE <iho> <iho_set_expression> As defined in the Section 2.1, a collection IHO does not directly encapsulate physical data, but instead contains references to other IHOs. If the collection IHO is indexed, it contains information about the index location and about the proper query method. The creation of collection IHOs is performed by the index operation that takes the collection type (LSI, WAIS, etc.), a set of IHOs, and the desired location.

<index_expression> := INDEX <collection_type> <iho_set_expression> <location>
String Expression
String expressions may be string constants, attribute expressions, or legal perl [wal91] strings. Attribute expressions extract values of IHO attributes.

<string_expression> := <string_constant> | <attribute_expression> | <Perl_string_expression> <string_set_expression> := <attribute_set_expression>

<attribute_expression> := ATTR <iho> <string>

<attribute_set_expression> := ATTR <iho_set_expression> <string>

Set Expression
Given the provision for sets in the language, there has to be a way to compute set unions and intersections. The union operation may be used to merge two sets, to add an item to a set, or to convert a single item into a one-item set. The intersection operation may be used to compute common members of two sets.

<set_union> := { <item> [,<set_expression>] } | { <set_expression>,<set_expression> } <set_intersection> := { <set_expression>&<set_expression> }
Iteration Statements
Selective access to set members is provided by the forall statement of the language. The such that clause of the statement supports selectivity by excluding members that do not meet the boolean combination of conditions. Each condition may be defined as either a logical comparison or a pattern matching operation. Regular expressions have their usual definition and are not further explained.

<forall> := FORALL <item> IN <set> [SUCH THAT <conditions>] {<statements>} <conditions> := <condition> | (<conditions>) | <condition> <bool_op> <conditions> <condition> := <item> <comp_op> <item> | <string> <match_op> <regular_expr> <bool_op> := AND | OR | NOT <comp_op> := == | != <match_op> := =~ | !~
Input/Output Statements
The role of the input/output statements is to support the reading and writing of intermediate results and the writing of generated metadata entities.

<input_output> := <write> | <read> <write> := WRITE <vars> [<location>]; <read> := READ <vars> [<location>];

3.3 Example

To better understand the language, consider the example of creating a repository of C programs. As noted in Section 2.1, the physical data associated with generated metadata entities is not part of the repository. As explained in Section 3.1, statements of IRDL that implement the repository get translated into Generator commands, which, in turn, generate metadata entities. In addition, the interpretation of index statements results in the generation of independent indices from the encapsulated physical data.

We discuss two alternative ways of representing C programs (Figures 4a and 4b). In both cases we encapsulate C files, as well as individual functions. We impose a specific interpretation on physical data by selecting a type, which in effect makes physical objects instances of particular terminal classes (Section 2.2).

Each class has a method responsible for separating portions of physical data that correspond to individual information units. In the case of the C class, this method separates individual functions and names them by their signatures.

To implement the repository, the structure of which is shown in (Figure 4a), we need to perform the following steps:

  1. For each C file do the following:
    1. Create simple IHOs that encapsulate individual functions that occur in this file.
    2. Create a composite IHO that encapsulates the file and points to IHOs created in step 1.1.
  2. Create an indexed collection of the composite IHOs created in step 1, using LSI for indexing physical data.
The IRDL program that implements the structure in (Figure 4a) is shown in (Figure 5).
BEGIN COLLTYPE LSI; DATATYPE TXT, C; VAR IHO: File_IHO, LSI_Collection; VAR SET IHO: File_IHO_SET, Function_IHO_SET; File_IHO_SET = ENCAP TXT "/u/kjshah/test/src"; FORALL File_IHO IN File_IHO_SET { Function_IHO_SET = ENCAP C File_IHO; File_IHO = COMBINE IHO Function_IHO_SET; WRITE File_IHO, Function_IHO_SET; } LSI_Collection = INDEX LSI File_IHO_SET "/u/kjshah/db/c"; WRITE LSI_Collection; END Figure 5. Sample IRDL program for the structure in Figure 4a
We begin by encapsulating C files as if they were just plain text. The generator's encapsulation command invokes the TXT type's data separation method, which associates each file in the /u/kjshah/test/src directory with an information unit and creates a simple IHO for each unit. The result is assigned to the set variable File_IHO_SET and iterated over in the forall statement.

By the abuse of notation, we will say that a function occurs in an IHO if it occurs in the file encapsulated by the IHO. For each IHO in the set, we first encapsulate individual functions that occur in this IHO, then convert the file IHO into a composite object by combining it with the set of newly created function IHOs, and use the write statement to output results.

Finally, we use the index statement that gets translated into the Generator group command. This command first creates a collection object that references composite IHOs that encapsulate individual files, and then builds a full-text index of these encapsulated files using the LSI technology. The write statement is once again used to output the results.

To implement the repository, the structure of which is shown in (Figure 4b), we need to perform steps that are similar to those for the structure in (Figure 4a):

  1. For each C file do the following:
    1. Create simple IHOs that encapsulate individual functions that occur in this file.
    2. Convert function IHOs into composite objects by combining them with one-element sets that contain the file IHO.
  2. Create an indexed collection of the composite IHOs from step 1, using LSI for indexing physical data.
The IRDL program that implements the structure in (Figure 4b) is shown in (Figure 6).
BEGIN COLLTYPE LSI; DATATYPE TXT, C; VAR IHO: File_IHO, Function_IHO, LSI_Collection; VAR SET IHO: File_IHO_SET, Join_IHO_SET; File_IHO_SET = ENCAP TXT "/u/kjshah/test/src"; FORALL File_IHO IN File_IHO_SET { Function_IHO_SET = ENCAP C File_IHO; FORALL Function_IHO IN Function_IHO_SET { Function_IHO = COMBINE Function_IHO (File_IHO); } Join_IHO_SET = (Join_IHO_SET,Function_IHO_SET); } LSI_Collection = INDEX LSI Join_IHO_SET "/u/kjshah/db/c"; WRITE LSI_Collection, Join_IHO_SET, File_IHO_SET; END Figure 6. Sample IRDL program for the structure in Figure 4b
This program is similar to the one in (Figure 5). With only few changes we manage to represent the same physical data quite differently. We start off again by encapsulating C files as if they were just plain text and iterating over the individual IHOs. The difference is that for each file IHO, we first encapsulate the occurring functions, and then iterate over the function IHOs. Each function IHO is converted into a composite object by combining it with the one-element set that contains the file IHO. Composite function IHOs are accumulated in the Join_IHO_SET set variable and then indexed using the LSI technology.

4.0 Related Work

In this section, we present an overview of work by other authors that we consider relevant to our approach. In Section 4.1, we take a look at different strategies of modelling heterogeneous information. In Section 4.2, we briefly discuss some of the work related to metadata representation, extraction, and organization. In Section 4.3, we consider high-level modelling languages in various domains.

4.1 Modelling Strategies

There exist various methodologies for capturing the internal structure of heterogeneous data [con87, hal87, hal88, nie90, tom91]. Different kinds of modelling strategies have been tried to create, maintain and retrieve information from the structures used by methodologies that range from conventional node-link models to full-fledged object-oriented models [mau94]. The most important problems may be summarized as follows:

The conventional bare-bones models involve hand-crafting. They don't scale-up, in addition to being prone to various inconsistencies. Also, the notion of a link as a physical entity forces users to interact with such systems at a very low, counter-intuitive level. In some cases, users may even have to be aware of the implementation details [dey90, van88].

At the same time, modelling strategies at the other end of the spectrum are quite heavyweight and unwieldy. Although these are more flexible and intuitive, they are too complicated for naive users. Instead of being aware of the implementation details, the user is forced to learn about the class hierarchy and its properties and structure the data accordingly. In addition, navigating such models is relatively complex.

Another approach has the semantic database perspective [bey94, abr76, fro82]. Here, structures are stored as sequences of binary relations.The advantages, which include ease of navigation, the ability to incorporate seamlessly large amounts of data, and strict integrity checking, are offset by the fact that in most practical cases, the semantic net is difficult to follow by anyone other than the creator. Also, if more than one person is responsible for maintaining the information base, they all have to be semantically consistent.

The Manchester Multimedia Information System [odo91] employs object-oriented techniques to build and represent images and other media. Aggregation and generalization in building representations of information bases [har91] are aimed at discovering semantic relationships based on existing structures and not at employing these structures directly. Similarly, ACE (Aggregation Clustering with Exceptions) [har91] analyzes existing structures to derive higher level structures but has no provisions for their construction.

Graph grammars [ehr90] have also been used to build information structures. However, some serious problems have to be resolved before they can be employed as practical languages for large scale structure building. One of the problems is determining how subgraphs are to be "glued" into the original graph. Another is the non-determinism of grammar derivations. The GOOD visual transformation language [gys94] derives a lot from the graph grammar and avoids the unresolved problems, at the cost of limited expressive power. To make matters worse, implementing the GOOD language [gem93] involves incurring a great deal of overhead and requires a database management system itself.

4.2 Use of Metadata

Metadata is being used increasingly by researchers in multimedia and in text and structured databases as an aid in the quest for seamless interoperability. Kiyoki et al [kiy94] implement a semantic associative search for images based on the keyword metadata representing the user's impression and the image's content. Anderson and Stonebraker [and94] have developed a metadata schema for satellite images. Jain and Hampapur [jai94] have proposed an intermediate representation for audio-visual information.

In InfoHarness, we have emphasized the extensive use of automatically generated metadata. Chen et al [che94] define metadata as derived properties of the media which are useful for information access or retrieval. Bohm and Rakow [boh94] have classified metadata according to their nature and related it with their different intended purposes. They have also drawn a distinction between the metadata and its organization. The same perspective is reflected in the hierarchical organization of InfoHarness objects and classes as illustrated in Figure 3.

InfoHarness shares some of its objectives with the RUFUS system [sho93]. RUFUS has an extensible object-oriented data model, storage system, and associated search and display methods for a variety of predefined file types. The system automatically classifies data files and extracts type-specific attributes. RUFUS users can search, browse, filter and display previously analyzed data. However, there is no support for generating semantic graph structures as in section 3.3. The corresponding metadata extraction process in InfoHarness does not deal with files but with information units, which may be associated with files, sets of files, or portions of these files. The latter approach is more flexible, providing finer control over interpreting the same data differently, depending on the local objectives.

Harvest [bow94, sch94] is a system, which gathers and indexes information from multiple heterogeneous sources. Its main goal is to support the creation and use of topic-specific information providers. Harvest's Essense subsystem performs the extraction of metadata that contains content summaries and type-dependent attributes. Harvest focuses more on resource discovery and access rather than heterogeneous data modelling. We feel that Harvest's information discovery techniques may be combined beneficially with InfoHarness' data encapsulation approach and methods of providing flexible control over the logical structure of information repositories.

4.3 Languages

Various modelling languages have been implemented for different domains. In particular, there has been a lot of work in network modelling, including AMPL
[fou87], GAMS [ken87], LINGO [lin91], and LPL [hur89]. These languages and model generators take an algebraic specification as input and generate a model. As in InfoHarness, a model is created based on the algebraic specification (IRDL script) and the existing data. The algebraic specification proved to be too rigorous for general use, so the GNGEN system [for94] provides a high level language for model creation. Like IRDL, this language provides constructs for specifying nodes and relationships, but involves much more details since arc weight specification, node costs, etc. must be included.

SHSML [tay93] is a hybrid systems modelling language necessitated by the development of embedded real-time software, which required modeling and evaluation. This language provides specifications for different software modules, along with its input and output interfaces. Modules may be created out of sub-modules making it possible to view the same software artifacts through different relationship models. However, unlike IRDL, no distinctions may be made between different modules and only strict tree structures may be modelled.

CML [fal94] is a declarative, compositional modelling language for logically specifying the symbolic and mathematical properties of the structure and behavior of physical systems.This language has a LISP-style syntax and is used to define model fragments (transistors, resistors etc.), which are used to build domain theories. For example, an amplifier may be modelled by composing these fragments into a graph structure. Expressive enough to denote fragments and their relationships, CML may be used also to denote causal relationships, initial values, qualitative differences, etc.

5.0 Conclusions and Future Work

As mentioned in the introduction, the primary objective of InfoHarness is to provide integrated, rapid, and transparent access to large amounts of heterogeneous information without it being relocated, restructured or reformatted in any way. We address this challenge by building meta-representations of the original information. To simplify this process, we have introduced IRDL, which is the primary focus of this paper. IRDL combines the flexibility of the object encapsulation with the power and convenience of a simple high-level declarative language. Statements of IRDL, combined with information contained in the physical data, together determine the structure of InfoHarness repositories.

In the paper, we made a distinction between the structure definition and the type definition components of the language. We have concentrated on the structure definition component under the assumption that all type methods are available from an ad-hoc type library. Our future work is directed towards the type definition component of the language. We are also performing investigations aimed at achieving better scalability of search by combining results of queries against independent indices. The initial public release of InfoHarness on the Internet is planned for the first quarter of 1995.

6.0 Acknowledgements

The authors want to thank Vipul Kashyap, Bob Mowry, Amit Sheth, Satish Thatte, Gomer Thomas, and Andrew Werth for comments and suggestions.

References:

[abr76]
J.R. Abrial, "Data semantics, database management", Proc. of IFIP TC2 Conf. (1976) Corgese, Corsica, North-Holland pp 1-59.

[and94]
J. Anderson and M. Stonebraker, "SEQUOIA 2000 Metadata schema for Satellite Images", (to appear) SIGMOD Record, special issue on Metadata for Digital Media, December 1994.
[boh94]
K. Bohm and T. Rakow, "Metadata for Multimedia Documents", (to appear) SIGMOD Record, special issue on Metadata for Digital Media, December 1994.
[bow94]
C.M. Bowman, P.B. Danzig, D.R. Hardy, U. Manber, and M.F. Schwartz, "Harvest: A scalable, customizable discovery and access system", TR CU-CS-732-94, Department of Computer Science, University of Colorado - Boulder.
[che90]
Y-F. Chen, M. Nishimoto and C. Ramamoorthy, "The C information abstraction system", IEEE Transactions on Software Engineering, March 1990.
[che94]
F. Chen, M. Hearst, J. Kupiec, J. Pederson and L. Wilcox, "Metadata for Mixed-Media Access", (to appear) SIGMOD Record, special issue on Metadata for Digital Media, December 1994.
[con87]
E. J. Conklin, "Hypertext: an introduction and survey", IEEE Computer Vol 20 No 9 (September 1987) pp 17-41.
[dee90]
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Hashman, "Indexing by Latent Semantic Indexing", Journal of the American Society for Information Science, 41(6), 1990.
[dey90]
L. De Young, "Links considered harmful", Proc. ECHT'90, Versailles, Cambridge University Press (November 1990) pp 238-249.
[ehr90]
H. Ehrig, H.-J. Kreowski and G. Rozenberg, "Graph-Grammars and their applications to Computer Science", Lecture Notes in Computer Science 532, 1990.
[fal94]
B. Falkenhainer, A. Farquhar, D. Bobrow, R. Fikes, K. Forbus, T. Gruber, Y. Iwasaki and B. Kuipers, "CML: A Compositional Modeling Language", Draft.
[fis91]
G. Fischer and C. Stevens, "Information access in complex, poorly structured information spaces", Proceedings of the 1991 CHI Conference, 1991.
[for94]
M. Forster and P. Mevert, "A tool for network modeling", European Journal of Operational Research 72 (1994) 287-299.
[fou87]
R. Fourer, D.M. Gay and B.W. Kernighan, "AMPL: A Mathematical Programming Language", Technical Report 87-03 Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, 1987/89.
[fro82]
R.A. Frost, "Binary-relational data structures", Computer Journal Vol 25 No 3 (1982) pp 358-367.
[gar93]
F. Garzotto. P. Paolini, and D. Schwabe. "HDM - A Model-Based Approach to Hypertext Application Design", ACM Transactions on Information Systems, 11(1), 1993.
[gem93]
M. Gemis, J. Paredaens and I. Thyssens, "A visual database management interface based on GOOD", Interfaces to Database Systems, Workshops in Computing, New York: Springer-Verlag, 1993, pp 155-175.
[gro94]
W. Grosky, F. Fotouhi and I. Sethi, "Content-Based Hypermedia - Intelligent Browsing of Structured Media Objects", (to appear) SIGMOD Record, special issue on Metadata for Digital Media, December 1994.
[gys94]
M. Gyssens, J. Paredaens, J. Van den Bussche and D. Van Gucht, "A graph oriented object database model", Knowledge and Data Engineering Vol 6 No 4 (1994) pp 572-579.
[hal87]
F. Halasz, T. Moran and R. Trigg, "Notecards in a nutshell", Proc. ACM CHI'87 (1987) pp 45-52.
[har91]
Y. Hara, A. Keller and G. Wiederhold, "Implementing hypertext relations through aggregations and relations", Proc. Hypertext 91, pp 75-90.
[hsu91]
C. Hsu, "The Meta-database Project at Renesselaer", SIGMOD Record, special issue on Semantic Issues in Multidatabases, 20(4), December 1991.
[hur89]
T. Hurlimann, "Reference manual for the LPL modeling language (Version 3.1)", Institute for Automation and Operations Research, university of Fribourg, 1989.
[jai94]
R. Jain and A. Hampapur, "Representations for Video Databases", (to appear) SIGMOD Record, special issue on Metadata for Digital Media, December 1994.
[kah91]
B. Kahle and A. Medlar, "An Information System for Corporate Users: Wide Area Information Service", Connexions - The Interoperability Report, 5(11), November 1991
[ken87]
D. Kendrick and A. Meeraus, "GAMS. An introduction", Development Research Department. The World Bank, 1987.
[ker83]
L.D. Kerschberg, D. Marchand and A. Sen, "Information system integration: a metadata management approach", Proc. of the Fourth International Conference on Information Systems, Houston, TX, 1983, pp 223-239.
[kiy94]
Y. Kiyoki, T. Kitagawa and T. Hayama, "A meta-database System for Semantic Image Search by a Mathematical Model of Meaning", (to appear) SIGMOD Record, special issue on Metadata for Digital Media, December 1994.
[lin91]
LINDO systems, Inc., "LINGO optimization modeling language", Chicago, IL, 1991.
[mat93]
C.J. Matheus. P.K. Chan, and G. Piatetsky-Shapiro, "Systems for Knowledge Discovery in Databases", IEEE Transactions on Knowledge and Data Engineering, December 1993.
[mau94]
H. Maurer, N. Scherbakov, K. Andrews and P. Srinivasan, "Object-oriented modelling of hyperstructure: overcoming the static link deficiency", Information and Software Technologies, June 1994, pp 315-322.
[nie90]
J. Nielsen, "Hypertext and Hypermedia", Academic Press (1990).
[odo91]
M. O'Docherty and C. Daskalakis, "Multimedia information systems - the management and semantic retrieval of all semantic data types", Computer J. Vol 34 No 3 (1991) pp 225-238.
[seg87]
A. Segev and A. Shoshani, "Logical modeling of temporal data", SIGMOD quarterly (1987).
[shk94]
L. Shklar, S. Thatte, H. Marcus, and A. Sheth, "The InfoHarness Information Integration Platform", http://www.ncsa.uiuc.edu/SDG/ IT94/Proceedings/Searching/shklar/shklar.html
[sho93]
K.Shoens, A Luniewski, P. Shwartz, J. Stamos, and J. Thomas, " The Rufus System: Information Organization for Semi-Structured Data", Proceedings of the 19th VLDB Conference, Dublin, Ireland, 1993.
[str88]
L. A. Streeter and K. E. Lochbaum, "Who knows: a system based on automatic representation of semantic structure", Proceedings of RIAO 88: User-oriented context-based text and image handling, Massachusetts Institute of Technology, Cambridge, MA, 1988, pp.379-388.
[sch94]
M.F. Schwartz, C.M. Bowman, P.B. Danzig, D.R. Hardy, and U. Manber, "The Harvest Information Discovery and Access System", http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/ schwartz.harvest/schwartz.harvest.html
[tay93]
J.H. Taylor, "Toward a Modelling Language Standard for Hybrid Dynamical Systems", Proc. of the 32nd Conference on Decision and Control, Texas, Dec 1993.
[tom91]
I. Tomek, S. Khan, T. Muldner, M. Nassar, G. Novak and P. Proszynski, "Hypermedia - Introduction and Survey", J. of MCA Vol 14 No 2 (April 1991) pp 63 - 103.
[van88]
A. Van Damm, "Hypertext `87 Keynote Address", Comm. ACM Vol 31 No 7 (July 1988) pp 887-895
[wal91]
L. Wall and R.L. Schwartz, "Programming perl", O'Reilly and Associates (1991).
[yan88]
N. Yankelovich, B. Haan, N. Meyrowitz and S. Drucker, "Intermedia: The concept and construction of a seamless information environment", IEEE Computer, 21(1), January 1988.

Footnotes

(TM)
InfoHarness is a trademark of Bell Communications Research, Inc.