Putting Legacy Data on the Web:
A Repository Definition Language
Leon Shklar, Kshitij Shah, and Chumki Basu
Bell Communications Research,
444 Hoes Lane, Piscataway, NJ 08854 and
Computer Science Department,
Rutgers University,
New Brunswick, NJ 08903
shklar@cs.rutgers.edu
Table of Contents
- Abstract:
- The objective of InfoHarness [shk94] is to
provide integrated and rapid access to huge amounts of heterogeneous legacy
information through WWW browsers. This is achieved with the help of metadata
that contains information about the type, representation, and location of
physical data. The proposed InfoHarness Repository Definition Language (IRDL)
aims to simplify the metadata generation process. It provides high flexibility
in associating typed logical information units with portions of physical data
and in defining relationships between these units. The proposed stable abstract
class hierarchy provides support for statements of the language that introduce
new data types, as well as new indexing technologies.
The InfoHarness (TM) system
[shk94] has been designed to provide rapid access to
huge amounts of heterogeneous information without any relocation or
restructuring of data. We have developed and synthesized the ideas of other
researchers [and94,
boh94, fis91,
gar93, gro94,
hsu91, jai94,
kiy94, sho93]
to use metadata for providing advanced search and browsing capabilities
without imposing constraints on information suppliers and creators. We
introduce the InfoHarness Repository Definition Language (IRDL) to simplify
the generation of metadata used to encapsulate existing information.
To support IRDL statements, we have proposed a stable abstract class
hierarchy. This hierarchy need not be modified to define terminal classes that
accommodate new types of information or utilize new indexing technologies.
IRDL's objective is to combine the flexibility of physical data object
encapsulation with the convenience of a high-level language. Two main
components of the language are responsible for introducing new data types
and for defining structures of information repositories. The type
definition component of the language is still under development and is
not discussed in this paper. The structure definition component allows
users to impose logical interpretations on physical data. In the current
version of IRDL, the structure definition statements may either refer to
types in the IRDL type library or to the MIME (Multipurpose Internet Mail
Extensions) types.
The InfoHarness prototype is now operational and on trial at Bellcore
for building software repositories, accessing geo-spatial data, and a variety
of other applications. It provides access to the original information from
Mosaic and other World-Wide Web (WWW) browsers through an HTTP gateway. The
IRDL interpreter that converts statements of the language into metadata
entities representing InfoHarness repositories is under construction.
The organization of the paper is as follows. In Section 2,
we discuss the role metadata plays in the transparent access to new and
existing information. In Section 2.1, we discuss the
organization of metadata. In Section 2.2, we present the
stable and extensible class hierarchy, which allows subclasses to
inherit general methods from their superclasses and to customize them for
their own use. In Section 3, we discuss the automatic
generation of metadata entities that represent InfoHarness repositories. In
Section 3.1, we define low-level metadata generation
commands. We then argue that it is impractical to manually write down these
commands for a real application. In the following section,
we introduce the specification of a high-level repository definition language.
In Section 3.3, we present an example of applying this
language to building repositories of C programs. Section 4
is devoted to related efforts. Conclusions and our plans for future work
are presented in Section 5.
The most important feature of InfoHarness is its ability to provide access to
information without changes in the location and representation of data. This
is achieved by creating metadata and associating it with physical information.
Metadata for different media types is often defined as derived properties of
the media, which are useful in information access or retrieval
[che94]. An InfoHarness Repository does not contain the
physical information itself but is composed of metadata entities that are
described in Section 2.1. The generation of these entities
is controlled by the structure definition statements of IRDL. In the following
subsection we present the InfoHarness class hierarchy and its properties,
which are crucial for the extensibility of the system and the future type
definition component of the language.
Metadata entities, which encapsulate units of physical information of interest
to end-users, are called information units (IU). An IU may be associated
with a file (e.g., a man page), a portion of a file (e.g., a C function or
a section of a paper), a set of files (e.g., a collection of related bitmaps),
or a request for the retrieval of data from an external source (e.g.,
a database query).
An InfoHarness object (IHO) is defined recursively as one of the
following (Figures 1 and 2).
- A simple InfoHarness object (Figure 1).
- A collection object that contains a set of pointers to other InfoHarness objects (Figure 2).
- A composite object that combines a simple IHO and a set of references to other IHOs (Figure 1).
IHOs have unique object identifiers that are recognized and maintained by the
system. Each IHO (with exception of collection IHOs) stores the location of
physical data and the data retrieval method, as well as additional parameters
needed by the method to separate the relevant portion of information. For
example, an IHO associated with a C function will contain the path information
for the C file, the name and location of the program that knows how to
separate out a function from a C file, and the name of the function to be
passed to this program. In addition, each IHO may contain an arbitrary number
of attribute-value pairs (e.g., owner, last update, security information,
decompression method). An IHO that consists of a single IU is called a
simple object.
Both collection IHOs and composite IHOs reference a set of other IHOs. Each
IHO contains unique object identifiers of all set members
(its children). An IHO that contains both an IU and a reference to
a set of other IHOs is called a composite object. An example of the
composite object is this paper's abstract combined with the set containing
postscript, HTML, and plain text versions of the full paper.
Collection IHOs may also point to independent indices that reference members
of the encapsulated sets. We will further refer to such objects as indexed
collections, and say that an IHO belongs to an indexed collection if it is
a child of the collection object. Indexed collections store information about
the location of both the index and the query method. Any collection IHO may
make use of its own data retrieval method that is not part of InfoHarness. As
a result, an InfoHarness Repository may easily be created from existing
heterogeneous index structures.
An InfoHarness Repository (IHR) is a set of IHOs that are known to a single
InfoHarness server. These IHOs are not necessarily pointed to from any single
IHO. An IHO may be a child of any number of collection objects
(its parents). Each IHO that has one or more parents always contains
unique object identifiers of its parent objects. An IHO that does not have any
parent is unreachable from any other IHO and may only be accessed if it is
used as an initial starting point (or entry point) in the IHR traversal.
This section describes the InfoHarness class hierarchy
(Figure 3). The Abstract Classes provide method
sharing between groups of terminal classes. The stability and extensibility of
the hierarchy is crucial for the type definition component of IRDL, which is
used to define new terminal classes. The structure definition statements of
the language would refer to these classes as types. Interpreting the
statements would result in instantiating the terminal classes as InfoHarness
objects. Examples of Terminal Classes include encapsulators of external
viewers, as well as indexing technologies like Wide-Area Information Service
(WAIS) [kah91] and Latent Semantic Indexing (LSI)
[dee90].
The abstract class hierarchy shown in Figure 3 is constructed with a dual
objective:
- To represent heterogeneous data and utilize various indexing technologies.
- To support server-side and client-side run-time processing of information.
The distinction between the Server Processing class and the Client
Processing class is based on the differences in processing information,
while the rationale for their subclasses is in the data representation. The
abstract classes are discussed in subsections 2.2.1-2.2.3.
The definition of new terminal classes is discussed in
subsection 2.2.4.
For instances of terminal classes that are subclasses of the Server
Processing abstract class, run-time access to the data is provided by
running processes on the server. The subclasses of this class are the abstract
classes Server Formatted Data and Executable. The subclasses of
Server Formatted Data serve to access physical data by running external
viewers. The abstract class Executable represents IUs that encapsulate
application programs.
Terminal classes are not part of the InfoHarness implementation and are
determined by the type-definition statements of IRDL that reflect design
choices made by InfoHarness administrators. For example (Figure 3), a terminal
class for man pages may be defined either as a subclass of Server
Formatted Data or as a subclass of Client Formatted Data. The
choice may depend on the availability of data, browsing tools, and the
security considerations.
For instances of terminal classes that are subclasses of Client
Processing, data is first transferred to the client and then processed.
Most of the data types may be defined by instantiating either a subclass of
Server Processing or a subclass of Client Processing. The
exceptions are Audio and Text that can not be defined as
subclasses of Server Processing.
The special treatment of audio files is determined by the need to play
a recording on a client machine for it to be heard by an end-user. Data of
this type is first transferred to a client machine and then directed to
/dev/audio (for UNIX). For instances of Text the information is
passed on directly to HTTP clients. It is a natural choice for representing
plain text or html documents.
Instances of the Collection class are not directly associated with
physical information. Instead, they represent relationships between IHOs.
Instances of the Indexed Collection class are associated with an
external index (Figure 2) that is used at run-time to
select members of the collection. Instances of the Indexed Collection
class are presented to an end-user as a query interface, while instances of
Non-Indexed Collection are presented as a full list of members.
The InfoHarness class hierarchy is open in that new terminal classes may be
defined to accommodate the vast variety of information. In the future, new
terminal classes will be defined from abstract classes using the type
definition component of IRDL. Currently, the new types are defined by
customizing methods inherited from abstract classes. The abstract classes are
permanent and form the core of our hierarchy, whereas the terminal classes are
defined when necessary to meet specific requirements.
Similarly, new indexing strategies are supported by defining new terminal
classes. The only additional complication is in maintaining the mapping
between two member identifications: the one known to the indexing algorithm
and the one known to InfoHarness (Figure 2). It is fairly
easy to support WAIS, LSI, and similar technologies by defining appropriate
terminal classes.
This section describes the creation of InfoHarness repositories by generating
metadata entities based on both user requirements and physical data. As
defined in Section 2.1, each InfoHarness Repository, or
IHR, is a set of IHOs, some of which encapsulate physical data, and some of
which represent sets of other IHOs. The creation of an IHR amounts to the
generation of metadata entities that represent IHOs and to the indexing of
physical information encapsulated by members of indexed collections.
The IHR creation is controlled by the IRDL Interpreter, which evaluates
IRDL statements and issues commands for the Generator. The latter
produces metadata entities that encapsulate and group together physical data.
In Section 3.1, we briefly discuss the metadata Generator
and its commands. In Section 3.2, we introduce the
specification of IRDL, which supports writing simple and concise definitions
of information repositories. In Section 3.3, we discuss
examples of applying IRDL to C programs.
In this section we discuss the Generator commands that control the creation of
metadata entities, which either encapsulate physical data, or group other
encapsulations into sets. In addition, the Generator is responsible for the
creation of physical indices that reference members of indexed collections.
There are only three different Generator commands:
- The encapsulate command requires information about type
and location of physical data. It returns a set of IHOs, each of which
encapsulates a piece of data. Boundaries of these pieces are determined by
the type. For example, an encapsulation command may refer to the type
rmail and the location of an RMAIL file. The output in this
example is a set of IHOs, each of which is associated with a separate mail
message.
- The group command requires a set of pointers to individual IHOs and,
optionally, the desired type of the index. The command generates an IHO
associated with the collection, as well as parent-child and child-parent
relationships between the collection IHO and each member of the input set.
The optional type parameter determines the indexing technology to be used for
indexing physical data associated with member IHOs. If the type is not
specified, no index is created.
- The merge command requires an IHO and a set of references to
additional IHOs. It produces a composite IHO that encapsulates the same
physical data as the input IHO and contains the mentioned set of references.
The Generator does not have any flow control commands because that is
determined by the Interpreter. It is possible to use the Generator on its own,
but to manually write down all commands for even a simple application would be
impractical. In reality, the Generator commands are produced by the IRDL
Interpreter.
The primary objective of IRDL is to provide help in building information
repositories. It combines the usual features of a structured programming
language with the non-procedural support for data encapsulation, set
operations, and content-based indexing. In this section we describe the
proposed language constructs and explain the rationale behind them.
Section 3.3 will then provide some concrete examples
of using IRDL to create repositories of C programs.
Any IRDL program is a sequence of declarations followed by a sequence of
statements. A legal program must have at least one declaration and one
statement.
<program> := begin; <body> end;
<body> := <declarations> <statements>
<declarations> := <declaration> | <declaration> <declarations>
<statements> := <statement> | <statement> <statements>
A declaration may be a type declaration or a variable declaration.
<declaration> := <type_declaration> | <variable_declaration>
Type declarations are required for both collection types and data types.
Collection types are associated with particular indexing technologies (WAIS,
LSI, or user-defined). The declaration of data types (e.g., TXT, C) helps to
utilize proper data encapsulation methods. Of course, all declared types must
be available from the type library.
<type_declaration> := COLLTYPE <collection_types>; | DATATYPE <data_types>;
<collection_types> := <collection_type> | <collection_type>,<collection_types>
<collection_type> := SET | WAIS | LSI | <user_defined_type>
<data_types> := <data_type> | <data_type>,<data_types>
In variable declarations, the SET qualifier determines whether the
interpreter returns a handle to a single element or to a set of elements. If
the handle identifier represents a set, it may be iterated upon using
iteration statements that are discussed later in this section. The language
supports two built-in types: IHO and STRING.
At this time, values of individual attributes are always treated as strings,
therefore, we did not introduce any arithmetic types.
<variable_declaration> := VAR [SET] <built-in> <vars>;
<built-in> := IHO | STRING
<vars> := <var> | <var>,<vars>
<var> := <item> | <set>
<item> := <iho> | <string>
<set> := <iho_set> | <string_set>
A statement may be a variable assignment, a set iteration, or input/output.
Assignments may be performed on individual elements and sets of elements.
Selective access to individual set elements is provided by forall
statements. Simple input/output directives promote the reuse of existing
metadata entities.
<statement> := <assign> | <forall> | <input_output>
Assignment statements support the assignment of individual items and sets of
items, where an item is either an IHO or a string. It is only possible to
assign set expressions to set variables and individual items to item variables.
IHO expressions may only be assigned to IHO variables and string expressions
may only be assigned to string variables.
<assign> := <var> = <expression>;
<expression> := <item_expression> | <set_expression>
<item_expression> := <item> | <iho_expression> | <string_expression>
<set_expression> := <set>
| <iho_set_expression>
| <string_set_expression>
| <set_union>
| <set_intersection>
As defined in Section 2.1, the encapsulation of physical
data is performed through information units, which are contained within simple
and composite InfoHarness objects. The encapsulation process is controlled by
type methods, where the types represent the desired interpretation of physical
data. For example, a C file may be treated either as a text file or as a C
program. The encapsulation is performed through the encapsulate
statement of the language. It requires the desired interpretation, and either
the location of physical data or a previously defined IHO (of different type),
which encapsulates this physical data.
While simple objects may be created directly from information units,
a collection object may only be created from a set of objects. Creating
a composite object requires both an information unit and a set of other IHOs.
Consider the example of representing a C program. One alternative is to
create composite IHOs that encapsulate C files and reference simple IHOs,
which encapsulate individual functions that occur in these files.
The creation of composite objects is performed through the combine
statement that takes an IHO and a set and returns a composite object.
<iho_set_expression> := <encapsulate>
<iho_expression> := <combine_expression> | <index_expression>
<encapsulate> := ENCAP[SULATE] <data_type> <location>
| ENCAP[SULATE] <data_type> <iho>
<combine_expression> := COMBINE <iho> <iho_set_expression>
As defined in the Section 2.1, a collection IHO does not
directly encapsulate physical data, but instead contains references to other
IHOs. If the collection IHO is indexed, it contains information about the index
location and about the proper query method. The creation of collection IHOs is
performed by the index operation that takes the collection type (LSI,
WAIS, etc.), a set of IHOs, and the desired location.
<index_expression> := INDEX <collection_type> <iho_set_expression> <location>
String expressions may be string constants, attribute expressions, or legal
perl [wal91] strings. Attribute expressions
extract values of IHO attributes.
<string_expression> := <string_constant>
| <attribute_expression>
| <Perl_string_expression>
<string_set_expression> := <attribute_set_expression>
<attribute_expression> := ATTR <iho> <string>
<attribute_set_expression> := ATTR <iho_set_expression> <string>
Given the provision for sets in the language, there has to be a way to compute
set unions and intersections. The union operation may be used to merge two
sets, to add an item to a set, or to convert a single item into a one-item set.
The intersection operation may be used to compute common members of two sets.
<set_union> := { <item> [,<set_expression>] }
| { <set_expression>,<set_expression> }
<set_intersection> := { <set_expression>&<set_expression> }
Selective access to set members is provided by the forall statement of
the language. The such that clause of the statement supports selectivity
by excluding members that do not meet the boolean combination of conditions.
Each condition may be defined as either a logical comparison or a pattern
matching operation. Regular expressions have their usual definition and are
not further explained.
<forall> := FORALL <item> IN <set> [SUCH THAT <conditions>] {<statements>}
<conditions> := <condition> | (<conditions>) | <condition> <bool_op> <conditions>
<condition> := <item> <comp_op> <item> | <string> <match_op> <regular_expr>
<bool_op> := AND | OR | NOT
<comp_op> := == | !=
<match_op> := =~ | !~
The role of the input/output statements is to support the reading and writing
of intermediate results and the writing of generated metadata entities.
<input_output> := <write> | <read>
<write> := WRITE <vars> [<location>];
<read> := READ <vars> [<location>];
To better understand the language, consider the example of creating
a repository of C programs. As noted in Section 2.1,
the physical data associated with generated metadata entities is not part
of the repository. As explained in Section 3.1, statements
of IRDL that implement the repository get translated into Generator
commands, which, in turn, generate metadata entities. In addition, the
interpretation of index statements results in the generation of
independent indices from the encapsulated physical data.
We discuss two alternative ways of representing C programs
(Figures 4a and 4b). In both cases we encapsulate C
files, as well as individual functions. We impose a specific interpretation
on physical data by selecting a type, which in effect makes
physical objects instances of particular terminal classes
(Section 2.2).
Each class has a method responsible for separating portions of physical data
that correspond to individual information units. In the case of the C class,
this method separates individual functions and names them by their signatures.
To implement the repository, the structure of which is shown in
(Figure 4a), we need to perform the following steps:
- For each C file do the following:
- Create simple IHOs that encapsulate individual functions that occur in
this file.
- Create a composite IHO that encapsulates the file and points to IHOs
created in step 1.1.
- Create an indexed collection of the composite IHOs created in step 1, using
LSI for indexing physical data.
The IRDL program that implements the structure in
(Figure 4a) is shown in (Figure 5).
BEGIN
COLLTYPE LSI;
DATATYPE TXT, C;
VAR IHO: File_IHO, LSI_Collection;
VAR SET IHO: File_IHO_SET, Function_IHO_SET;
File_IHO_SET = ENCAP TXT "/u/kjshah/test/src";
FORALL File_IHO IN File_IHO_SET
{
Function_IHO_SET = ENCAP C File_IHO;
File_IHO = COMBINE IHO Function_IHO_SET;
WRITE File_IHO, Function_IHO_SET;
}
LSI_Collection = INDEX LSI File_IHO_SET "/u/kjshah/db/c";
WRITE LSI_Collection;
END
Figure 5. Sample IRDL program for the structure in Figure 4a
We begin by encapsulating C files as if they were just plain text. The
generator's encapsulation command invokes the TXT type's data
separation method, which associates each file in the /u/kjshah/test/src
directory with an information unit and creates a simple IHO for each unit.
The result is assigned to the set variable File_IHO_SET and iterated
over in the forall statement.
By the abuse of notation, we will say that a function occurs in an IHO if it
occurs in the file encapsulated by the IHO. For each IHO in the set, we first
encapsulate individual functions that occur in this IHO, then convert the
file IHO into a composite object by combining it with the set of newly
created function IHOs, and use the write statement to output results.
Finally, we use the index statement that gets translated into the
Generator group command. This command first creates a collection
object that references composite IHOs that encapsulate individual files,
and then builds a full-text index of these encapsulated files using the LSI
technology. The write statement is once again used to output the
results.
To implement the repository, the structure of which is shown in
(Figure 4b), we need to perform steps that are similar
to those for the structure in (Figure 4a):
- For each C file do the following:
- Create simple IHOs that encapsulate individual functions that occur in
this file.
- Convert function IHOs into composite objects by combining them with
one-element sets that contain the file IHO.
- Create an indexed collection of the composite IHOs from step 1, using LSI
for indexing physical data.
The IRDL program that implements the structure in
(Figure 4b) is shown in (Figure 6).
BEGIN
COLLTYPE LSI;
DATATYPE TXT, C;
VAR IHO: File_IHO, Function_IHO, LSI_Collection;
VAR SET IHO: File_IHO_SET, Join_IHO_SET;
File_IHO_SET = ENCAP TXT "/u/kjshah/test/src";
FORALL File_IHO IN File_IHO_SET
{
Function_IHO_SET = ENCAP C File_IHO;
FORALL Function_IHO IN Function_IHO_SET
{
Function_IHO = COMBINE Function_IHO (File_IHO);
}
Join_IHO_SET = (Join_IHO_SET,Function_IHO_SET);
}
LSI_Collection = INDEX LSI Join_IHO_SET "/u/kjshah/db/c";
WRITE LSI_Collection, Join_IHO_SET, File_IHO_SET;
END
Figure 6. Sample IRDL program for the structure in Figure 4b
This program is similar to the one in (Figure 5). With
only few changes we manage to represent the same physical data quite
differently. We start off again by encapsulating C files as if they were
just plain text and iterating over the individual IHOs. The difference is
that for each file IHO, we first encapsulate the occurring functions, and
then iterate over the function IHOs. Each function IHO is converted into
a composite object by combining it with the one-element set that contains
the file IHO. Composite function IHOs are accumulated in the
Join_IHO_SET set variable and then indexed using the LSI technology.
In this section, we present an overview of work by other authors that we
consider relevant to our approach. In Section 4.1, we
take a look at different strategies of modelling heterogeneous information.
In Section 4.2, we briefly discuss some of the work
related to metadata representation, extraction, and organization.
In Section 4.3, we consider high-level modelling
languages in various domains.
There exist various methodologies for capturing the internal structure of
heterogeneous data [con87, hal87,
hal88, nie90,
tom91]. Different kinds of modelling strategies have
been tried to create, maintain and retrieve information from the structures
used by methodologies that range from conventional node-link models to
full-fledged object-oriented models [mau94]. The most
important problems may be summarized as follows:
- Creation of a new structure is a time consuming task, especially for
individuals and institutions who want to create such systems themselves.
- These approaches don't always allow the user to keep the information in
its native format.
- Maintenance of such fixed linked systems becomes non-trivial as the size
of the information base grows. New items become more difficult to add and
integrity cannot be automatically guaranteed on deletions.
The conventional bare-bones models involve hand-crafting. They don't
scale-up, in addition to being prone to various inconsistencies. Also, the
notion of a link as a physical entity forces users to interact with such
systems at a very low, counter-intuitive level. In some cases, users may even
have to be aware of the implementation details [dey90,
van88].
At the same time, modelling strategies at the other end of the spectrum are
quite heavyweight and unwieldy. Although these are more flexible and
intuitive, they are too complicated for naive users. Instead of being aware
of the implementation details, the user is forced to learn about the class
hierarchy and its properties and structure the data accordingly. In addition,
navigating such models is relatively complex.
Another approach has the semantic database perspective
[bey94, abr76,
fro82]. Here, structures are stored as sequences of
binary relations.The advantages, which include ease of navigation, the
ability to incorporate seamlessly large amounts of data, and strict
integrity checking, are offset by the fact that in most practical cases,
the semantic net is difficult to follow by anyone other than the creator.
Also, if more than one person is responsible for maintaining the information
base, they all have to be semantically consistent.
The Manchester Multimedia Information System [odo91]
employs object-oriented techniques to build and represent images and other
media. Aggregation and generalization in building representations of
information bases [har91] are aimed at discovering semantic relationships
based on existing structures and not at employing these structures directly.
Similarly, ACE (Aggregation Clustering with
Exceptions) [har91] analyzes existing structures to
derive higher level structures but has no provisions for their construction.
Graph grammars [ehr90] have also been used to build
information structures. However, some serious problems have to be resolved
before they can be employed as practical languages for large scale structure
building. One of the problems is determining how subgraphs are to be
"glued" into the original graph. Another is the non-determinism of
grammar derivations. The GOOD visual transformation language
[gys94] derives a lot from the graph grammar and avoids
the unresolved problems, at the cost of limited expressive power. To make
matters worse, implementing the GOOD language
[gem93] involves incurring a great deal of overhead and
requires a database management system itself.
Metadata is being used increasingly by researchers in multimedia and in text
and structured databases as an aid in the quest for seamless interoperability.
Kiyoki et al [kiy94] implement a semantic associative
search for images based on the keyword metadata representing the user's
impression and the image's content. Anderson and Stonebraker
[and94] have developed a metadata schema for satellite
images. Jain and Hampapur [jai94] have proposed an
intermediate representation for audio-visual information.
In InfoHarness, we have emphasized the extensive use of automatically
generated metadata. Chen et al [che94] define metadata
as derived properties of the media which are useful for information access or
retrieval. Bohm and Rakow [boh94] have classified
metadata according to their nature and related it with their different
intended purposes. They have also drawn a distinction between the metadata
and its organization. The same perspective is reflected in the hierarchical
organization of InfoHarness objects and classes as illustrated in
Figure 3.
InfoHarness shares some of its objectives with the RUFUS system
[sho93]. RUFUS has an extensible object-oriented data
model, storage system, and associated search and display methods for
a variety of predefined file types. The system automatically classifies data
files and extracts type-specific attributes. RUFUS users can search, browse,
filter and display previously analyzed data. However, there is no support
for generating semantic graph structures as in section 3.3. The corresponding
metadata extraction process in InfoHarness does not deal with files but with
information units, which may be associated with files, sets of files, or
portions of these files. The latter approach is more flexible, providing
finer control over interpreting the same data differently, depending on the
local objectives.
Harvest [bow94, sch94] is
a system, which gathers and indexes information from multiple heterogeneous
sources. Its main goal is to support the creation and use of topic-specific
information providers. Harvest's Essense subsystem performs the
extraction of metadata that contains content summaries and type-dependent
attributes. Harvest focuses more on resource discovery and access rather than
heterogeneous data modelling. We feel that Harvest's information discovery
techniques may be combined beneficially with InfoHarness' data encapsulation
approach and methods of providing flexible control over the logical structure
of information repositories.
Various modelling languages have been implemented for different domains. In
particular, there has been a lot of work in network modelling, including
AMPL
[fou87], GAMS
[ken87], LINGO [lin91], and
LPL [hur89]. These languages and model generators
take an algebraic specification as input and generate a model. As in
InfoHarness, a model is created based on the algebraic specification (IRDL
script) and the existing data. The algebraic specification proved to be too
rigorous for general use, so the GNGEN system
[for94] provides a high level language for model
creation. Like IRDL, this language provides constructs for specifying nodes
and relationships, but involves much more details since arc weight
specification, node costs, etc. must be included.
SHSML [tay93] is a hybrid systems modelling
language necessitated by the development of embedded real-time software,
which required modeling and evaluation. This language provides specifications
for different software modules, along with its input and output interfaces.
Modules may be created out of sub-modules making it possible to view the same
software artifacts through different relationship models. However, unlike
IRDL, no distinctions may be made between different modules and only strict
tree structures may be modelled.
CML [fal94] is a declarative, compositional
modelling language for logically specifying the symbolic and mathematical
properties of the structure and behavior of physical systems.This language
has a LISP-style syntax and is used to define model fragments (transistors,
resistors etc.), which are used to build domain theories. For example,
an amplifier may be modelled by composing these fragments into a graph
structure. Expressive enough to denote fragments and their relationships,
CML may be used also to denote causal relationships, initial values,
qualitative differences, etc.
As mentioned in the introduction, the primary objective of InfoHarness is to
provide integrated, rapid, and transparent access to large amounts of
heterogeneous information without it being relocated, restructured or
reformatted in any way. We address this challenge by building
meta-representations of the original information. To simplify this process,
we have introduced IRDL, which is the primary focus of this paper. IRDL
combines the flexibility of the object encapsulation with the power and
convenience of a simple high-level declarative language. Statements of IRDL,
combined with information contained in the physical data, together determine
the structure of InfoHarness repositories.
In the paper, we made a distinction between the structure definition and the
type definition components of the language. We have concentrated on the
structure definition component under the assumption that all type methods
are available from an ad-hoc type library. Our future work is directed
towards the type definition component of the language. We are also performing
investigations aimed at achieving better scalability of search by combining
results of queries against independent indices. The initial public release of
InfoHarness on the Internet is planned for the first quarter of 1995.
The authors want to thank Vipul Kashyap, Bob Mowry, Amit Sheth, Satish Thatte,
Gomer Thomas, and Andrew Werth for comments and suggestions.
- References:
- [abr76]
- J.R. Abrial, "Data semantics, database management", Proc. of
IFIP TC2 Conf. (1976) Corgese, Corsica, North-Holland pp 1-59.
- [and94]
- J. Anderson and M. Stonebraker, "SEQUOIA 2000 Metadata schema for
Satellite Images", (to appear) SIGMOD Record, special issue on
Metadata for Digital Media, December 1994.
- [boh94]
- K. Bohm and T. Rakow, "Metadata for Multimedia Documents",
(to appear) SIGMOD Record, special issue on Metadata for Digital Media,
December 1994.
- [bow94]
- C.M. Bowman, P.B. Danzig, D.R. Hardy, U. Manber, and M.F. Schwartz,
"Harvest: A scalable, customizable discovery and access system",
TR CU-CS-732-94, Department of Computer Science, University of
Colorado - Boulder.
- [che90]
- Y-F. Chen, M. Nishimoto and C. Ramamoorthy, "The C information
abstraction system", IEEE Transactions on Software Engineering,
March 1990.
- [che94]
- F. Chen, M. Hearst, J. Kupiec, J. Pederson and L. Wilcox, "Metadata
for Mixed-Media Access", (to appear) SIGMOD Record, special issue on
Metadata for Digital Media, December 1994.
- [con87]
- E. J. Conklin, "Hypertext: an introduction and survey",
IEEE Computer Vol 20 No 9 (September 1987) pp 17-41.
- [dee90]
- S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Hashman,
"Indexing by Latent Semantic Indexing", Journal of the American
Society for Information Science, 41(6), 1990.
- [dey90]
- L. De Young, "Links considered harmful", Proc. ECHT'90,
Versailles, Cambridge University Press (November 1990) pp 238-249.
- [ehr90]
- H. Ehrig, H.-J. Kreowski and G. Rozenberg, "Graph-Grammars and their
applications to Computer Science", Lecture Notes in Computer Science
532, 1990.
- [fal94]
- B. Falkenhainer, A. Farquhar, D. Bobrow, R. Fikes, K. Forbus, T. Gruber,
Y. Iwasaki and B. Kuipers, "CML: A Compositional Modeling
Language", Draft.
- [fis91]
- G. Fischer and C. Stevens, "Information access in complex, poorly
structured information spaces", Proceedings of the 1991 CHI
Conference, 1991.
- [for94]
- M. Forster and P. Mevert, "A tool for network modeling",
European Journal of Operational Research 72 (1994) 287-299.
- [fou87]
- R. Fourer, D.M. Gay and B.W. Kernighan, "AMPL: A Mathematical
Programming Language", Technical Report 87-03 Department of
Industrial Engineering and Management Sciences, Northwestern University,
Evanston, IL, 1987/89.
- [fro82]
- R.A. Frost, "Binary-relational data structures", Computer
Journal Vol 25 No 3 (1982) pp 358-367.
- [gar93]
- F. Garzotto. P. Paolini, and D. Schwabe. "HDM - A Model-Based
Approach to Hypertext Application Design", ACM Transactions on
Information Systems, 11(1), 1993.
- [gem93]
- M. Gemis, J. Paredaens and I. Thyssens, "A visual database management
interface based on GOOD", Interfaces to Database Systems, Workshops
in Computing, New York: Springer-Verlag, 1993, pp 155-175.
- [gro94]
- W. Grosky, F. Fotouhi and I. Sethi, "Content-Based
Hypermedia - Intelligent Browsing of Structured Media Objects",
(to appear) SIGMOD Record, special issue on Metadata for Digital Media, December 1994.
- [gys94]
- M. Gyssens, J. Paredaens, J. Van den Bussche and D. Van Gucht,
"A graph oriented object database model", Knowledge and Data
Engineering Vol 6 No 4 (1994) pp 572-579.
- [hal87]
- F. Halasz, T. Moran and R. Trigg, "Notecards in a nutshell",
Proc. ACM CHI'87 (1987) pp 45-52.
- [har91]
- Y. Hara, A. Keller and G. Wiederhold, "Implementing hypertext
relations through aggregations and relations", Proc. Hypertext 91,
pp 75-90.
- [hsu91]
- C. Hsu, "The Meta-database Project at Renesselaer",
SIGMOD Record, special issue on Semantic Issues in Multidatabases,
20(4), December 1991.
- [hur89]
- T. Hurlimann, "Reference manual for the LPL modeling language
(Version 3.1)", Institute for Automation and Operations Research,
university of Fribourg, 1989.
- [jai94]
- R. Jain and A. Hampapur, "Representations for Video Databases",
(to appear) SIGMOD Record, special issue on Metadata for Digital Media,
December 1994.
- [kah91]
- B. Kahle and A. Medlar, "An Information System for Corporate Users:
Wide Area Information Service", Connexions -
The Interoperability Report, 5(11), November 1991
- [ken87]
- D. Kendrick and A. Meeraus, "GAMS. An introduction",
Development Research Department. The World Bank, 1987.
- [ker83]
- L.D. Kerschberg, D. Marchand and A. Sen, "Information system
integration: a metadata management approach", Proc. of the Fourth
International Conference on Information Systems, Houston, TX, 1983,
pp 223-239.
- [kiy94]
- Y. Kiyoki, T. Kitagawa and T. Hayama, "A meta-database System for
Semantic Image Search by a Mathematical Model of Meaning", (to appear)
SIGMOD Record, special issue on Metadata for Digital Media,
December 1994.
- [lin91]
- LINDO systems, Inc., "LINGO optimization modeling language",
Chicago, IL, 1991.
- [mat93]
- C.J. Matheus. P.K. Chan, and G. Piatetsky-Shapiro, "Systems for
Knowledge Discovery in Databases", IEEE Transactions on Knowledge
and Data Engineering, December 1993.
- [mau94]
- H. Maurer, N. Scherbakov, K. Andrews and P. Srinivasan,
"Object-oriented modelling of hyperstructure: overcoming the static
link deficiency", Information and Software Technologies, June 1994,
pp 315-322.
- [nie90]
- J. Nielsen, "Hypertext and Hypermedia", Academic Press
(1990).
- [odo91]
- M. O'Docherty and C. Daskalakis, "Multimedia information systems
- the management and semantic retrieval of all semantic data types",
Computer J. Vol 34 No 3 (1991) pp 225-238.
- [seg87]
- A. Segev and A. Shoshani, "Logical modeling of temporal data",
SIGMOD quarterly (1987).
- [shk94]
- L. Shklar, S. Thatte, H. Marcus, and A. Sheth, "The InfoHarness
Information Integration Platform",
http://www.ncsa.uiuc.edu/SDG/
IT94/Proceedings/Searching/shklar/shklar.html
- [sho93]
- K.Shoens, A Luniewski, P. Shwartz, J. Stamos, and J. Thomas, "
The Rufus System: Information Organization for Semi-Structured Data",
Proceedings of the 19th VLDB Conference, Dublin, Ireland, 1993.
- [str88]
- L. A. Streeter and K. E. Lochbaum, "Who knows: a system based on
automatic representation of semantic structure", Proceedings of RIAO
88: User-oriented context-based text and image handling, Massachusetts
Institute of Technology, Cambridge, MA, 1988, pp.379-388.
- [sch94]
- M.F. Schwartz, C.M. Bowman, P.B. Danzig, D.R. Hardy, and U. Manber,
"The Harvest Information Discovery and Access System",
http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/
schwartz.harvest/schwartz.harvest.html
- [tay93]
- J.H. Taylor, "Toward a Modelling Language Standard for Hybrid
Dynamical Systems", Proc. of the 32nd Conference on Decision and
Control, Texas, Dec 1993.
- [tom91]
- I. Tomek, S. Khan, T. Muldner, M. Nassar, G. Novak and P. Proszynski,
"Hypermedia - Introduction and Survey", J. of MCA Vol 14 No 2
(April 1991) pp 63 - 103.
- [van88]
- A. Van Damm, "Hypertext `87 Keynote Address", Comm.
ACM Vol 31 No 7 (July 1988) pp 887-895
- [wal91]
- L. Wall and R.L. Schwartz, "Programming perl",
O'Reilly and Associates (1991).
- [yan88]
- N. Yankelovich, B. Haan, N. Meyrowitz and S. Drucker, "Intermedia:
The concept and construction of a seamless information environment",
IEEE Computer, 21(1), January 1988.
Footnotes
- (TM)
- InfoHarness is a trademark of Bell Communications Research, Inc.