Information Mediation

Information Mediation
Knowledge Domain Integration, Interoperability and Transactions

Benjie Chen
Massachusetts Institute of Technology
benjie@mit.edu

Kinji Yamasaki
Benay Dara-Abrams
Sholink Corporation
{kinji,benay}@sholink.com

Abstract

Now that we are in the midst of a transition from an industrial society to an information driven society, it is necessary to understand that the importance of information not only lies in the information itself, but also in the process which deals with the information. Today's information frameworks focus strongly on the collection of data, but omit the relevance of the data. This paper presents a framework for dealing with three problems: 1) knowledge domain integration: screening and transformation of data to information, and information to knowledge; 2) knowledge domain interoperability: obtaining and sharing knowledge and knowledge related tasks in a distributed environment; and 3) knowledge domain transaction: nested and parallel transactions of information between organizations. This framework not only places strong emphasis on automated knowledge processing, but also on the necessary human intervention in the process themselves. We call this combination of automation and human intervention in knowledge processing "information mediation".

1. Introduction

The World Wide Web has experienced rapid growth since its original invention in the early nineties. One of the common misconceptions today is that it is easy to find information on the web. This may be true for certain types of "popular" information, but often a search on AltaVista TM results in thousands of hits of which none is really relevant. This problem is the result of the strong emphasis which has been placed on data collection.

There is a major difference between raw data and information, however. In [1] and [2], we see the definition of information as described in bits:

# of bits = log 2 N Where N is the number of possible states of a particular information medium. To express the equation in plain terms: the number of bits in a piece of data is dependent upon the expected-ness of the data. If a piece of data contains fully expected information, then the number of different possible states of the medium is 1, and therefore the number of bits associated with the data is 0. On the other hand, if the data is fully unexpected, then the number of bits associated with the data would be high. In this paper we define information as data with high number of bits. According to this definition, it is easy to understand why what some people would consider to be information, others would consider to be data. The difference between information and knowledge is not as explicit, however. In this paper we define knowledge as information which is not only news and of importance to us, but also applicable in performing a certain task. For example, "it will rain tomorrow in San Jose" may be a piece of information to someone, but "if you go to www.quotes.com you will get stock quotes" is a piece of knowledge. Again, what may be knowledge to some people, may only be information, or even data to others.

Many of the commercial information management frameworks in existence today do not capture these differences between data and information, and between information and knowledge. Most of the frameworks that do are AI driven. AI driven Knowledge Based Systems (KBS) present another set of problems. Most AI systems are complex in nature, as there is a high degree of complexity involved in building a system that will conduct human intelligence: judging what's data, what's information, and what's knowledge. This complexity often results in occasional or frequent incorrect judgments: data will sometimes be presented as information, and even knowledge.

The framework proposed in this paper not only addresses the differences between data, information and knowledge, but also other information management functionalities such as interoperability and transactions. Moreover, this model, called Information Mediation, avoids the complexity presented by AI models by introducing human elements into the information management process. It builds on top of the knowledge domain models presented in [3], but with considerable amount of revisions and extensions. In section 2 we discuss the overall architecture of information mediation. In section 3 we review some issues in the implementation. And in section 4 we focus on current and future works.

2. Architecture Framework

The Information Mediation framework consists of three components: Knowledge Domain Integration, Knowledge Domain Interoperability, and Knowledge Domain Transaction. Knowledge Domain Integration is a model for dealing with data, filtering and/or screening of data into information, and transformation of appropriate information to knowledge; Knowledge Domain Interoperability deals with collaboration work and sharing of information and knowledge; and Knowledge Domain Transaction handles the transaction of information and knowledge between organizations.

Terminology

In this paper we use the following definitions:

Business Process: also referred to as "process". A business process is a group of events and actions grouped together to perform an integrated task.
Process Component: an individual node on a business process, responsible for a particular task.
Workflow Engine: a centralized engine that oversees business processes and workflows.

Other terminologies are addressed and defined as we go along.

Knowledge Domain Integration

The Knowledge Domain Integration component differentiates among data, information and knowledge. Traditional AI systems implement automated transaction systems to deal with the differentiations, necessary conversions and rejections. This method is okay as long as the data to be examined are of static patterns. This is definitely not true in today's world, where compound documents with different layouts popularly exist. In the Information Mediation framework, we introduce an automated workflow system whose components include automated processes and human interface. The human interface allows information administrators to guide the system in selection of data, storage of information and application of knowledge. The automated processes, as a part of a workflow engine, execute programs that handle input data. These programs can range from database tools, search engine indexers, to third party full blown applications. The workflow system is built based upon WFMC (Workflow Management Coalition) Specifications ([4]), but differs in certain areas of interoperability supports and protocol APIs.

Knowledge Domain Interoperability

Knowledge Domain Interoperability allows distributed collaboration work and sharing of information and knowledge. It is achieved through the workflow's distributed environment support, built based on CORBA. Processes across organizations may participate as a part of the workflow, through a well defined CORBA IDL interface to the workflow component API, and freely (or with access control) obtaining information and knowledge owned by the central workflow system. This collaboration design differs from the traditional whiteboard collaboration model in which active participations are necessary. The workflow engine in this case handles all collaborative activities and transactions, and no human interventions are required. This collaboration model not only promotes sharing of information and knowledge, but also supports distributed enrichment of data into information, or information into knowledge. This process is examined in more detail in section 3.

Knowledge Domain Transaction

The Knowledge Domain Transaction component allows for nested and interactive transactions between organizations' information services. This nested model creates parallelism in the data transaction process. It not only increases the speed of network transactions, but also allows more dynamic interactions between processes of the workflow system, such as information content negotiation. The component provides an abstraction to information warehouses across organizations, and creates a mean to extract and store information.

3. Design Rationales

Integration of Data, Information, and Knowledge

The separations of knowledge, information and data are based on different criteria. When raw data meets certain criteria, it can be considered information. The criteria that must be met before the transformation of information to knowledge is more strictly defined. A piece of information must be applicable in a task to be called knowledge. As an example, let's consider the following defined criterion: data originated from known and/or reliable sources are good data, and data originated from unknown sources can be discarded. This is a well formed criteria, and applying it will get us one step closer to information discovery. Data that passed this particular criterion, however, is by no means Information.

In the above example, we defined a static criterion, as it does not take into consideration of user requirements. This type of criteria should only be used as a first level screening, but not to decide what's useful to who. Data collection systems such as AltaVista TM deals with this kind of data selection criteria. Smarter information systems would incorporate dynamic criteria into their selection processes. An example of a more dynamic criterion could be: if the source is "ComputerWorld" and the subject is Java, it is useful; otherwise, it is not. Such criterion would filters out data not useful, but still not necessarily limits down the raw data to only information. Plus, in a big organization, this criteria would fit some people, but not all of the people. To have each member of the organization define their own information criteria would create chaos and unwanted complexity in our information integration system.

This is one of the differences between the Information Mediation model from traditional AI systems or those information systems described in [3]. In the Information Mediation framework we introduce a human interface factor in the information and knowledge discovery and sharing process. Human intervention could be used in finally deciding what's useful, and what's not. This is especially true at applying information, thus treating it as knowledge. An example of this would be: if the information from INS disproves a person's status, then reject this person's insurance application. Such decision and application of knowledge could probably be done with AI driven knowledge-base systems, but such systems would be complex in nature, and at times inflexible, not to mention the potential lack of confidence in the decisions of a machine.

Achieving Integration Through Workflow Engines

Traditional information discovery softwares use sequential processing techniques to reject and filter data. Such techniques are useful and simple to implement, but lacks flexibility, versatility and control. The sequential execution of processes also does not allow human intervention during the middle of the process.

The Information Mediation framework uses a workflow engine to control data, information and knowledge related tasks. Business processes are managed by the engine, and each business processes can and often do take on data from a particular source. A workflow engine consists of multiple business processes, defined through a common process definition language, as suggested in [4]. At the start of a business process, the process must register with the workflow engine. The workflow engine is responsible for keeping track of the processes, process components, states that must be transfered from one component to another, and the data themselves. This centralized control model allows inter-process communications between business processes.

The workflow process components connect to the workflow engine through well-defined component APIs (subject of the next few subsections). These components range from filters and data format converters to database tools and third party applications. Human interventions can be a part of a business process by participating through a graphical user interface.

The workflow system in the Information Mediation framework is trigger driven. Processes and components are invoked by triggers. A trigger could be the end of another process or component, an action by a knowledge administrator, or a timed event. Each trigger can either start a new process, or invoke a subprocess within an established process. Most components that provide human interface trigger subprocesses and/or other processes.

Another important aspect of the workflow engine is the dynamic definition of business processes. As mentioned before, business processes are defined through the process definition language. One of the capabilities of the Information Mediation workflow engine is dynamic loading of such definitions. This capability allows knowledge administrators to change definitions of business processes in real time.

Protocol Compatibility - The Need of an Object Model for Information Transaction

Having a knowledge integration system takes care of collection and transformation of data, but not the sharing of information and knowledge. The most difficult part of solving the information sharing problem resides in having incompatible machines and protocols. Without a common communication link, information and knowledge in one organization tend to become proprietary and transform into legacy data. Legacy information are hard to use in today's heterogeneous environment.

The Information Mediation framework uses distributed objects as the common protocol for sharing of information. This choice is based on the flexible and cross-platform requirements for an information system. Distributed objects also offer plug-and-play interfaces and portability (see [5] for a more featured list of advantages to using objects instead of traditional client-server approach). A CORBA 2.0 ([6]) compliant ORB engine was used in the initial implementation of the Information Mediation framework.

To fully capture the current need for information sharing, and at the same time to allow flexibility and scalability, the Information Mediation framework builds its objects around a few invariants associated with an information system. These objects are placed internally into the workflow system, and thus allowing remote clients and systems to access information in the workflow engine. The invocations of each of the objects also act as triggers to business processes.

In the Information Mediation framework we identified four invariants to information systems: delivery of data, retrieval of information, processing of data and management of information storage. Based on these invariants we defined four object interfaces using IDL. These four object interfaces are:

Delivery: this interface takes in a set of data, a description, a source tag, and a destination tag. Associated with the delivery object are business processes that act on the data delivered. Various types of data are dealt using different business processes.
Retrieval: this interface takes in a set of extended SQL queries and returns relevant information associated with the query. We extended the allowed SQL query formats to include a few system dependent features.
Processing: this interface allows clients to call upon a particular data processing method to act on the input data. This interface triggers a business process previously defined.
Management: this interface takes in actions formed as SQL statements that can potentially change the state of stored information, or the way which information is processed. This interface is used to perform remote/distributed management of information and the environment which the information is stored in.

These objects merely represent a set of invariants that we believe are important in information management. More detailed objects can be build to form object models for different industries.

Workflow Interoperability - Cross Organizational Subprocesses and Data Enrichment

The objects described allow us to define and perform workflows that extend beyond the boundary of physical organizations. In particularly with the use of the "Processing" object interface, an knowledge administrator may define business processes with components that reside on remote systems, thus achieving interoperability.

Interoperability is an important part of the Information Mediation framework in that it is nearly impossible in today's world for one organization to own all the information mediation tools necessary to transform data into information and information into knowledge. Let's go back to our insurance application example. In order to receive approval for healthcare insurance application one must have legal status in the United States, and such status are only officially stored at INS. In this case interoperability between the healthcare company's information system and INS' information system is highly desirable. With interoperability in place, the data enrichment processes and other authorities related to data, information and knowledge no longer need to reside in one location.

Defining a remote object to be a component adds complexity to our system, unfortunately. Two particular complexities that stand out are state control, and exception handling.

As mentioned before, associated with each business processes are states which the workflow engine keeps. The problem created from interoperability in business process definition is around the ownership of these states. This is a tricky problem because we cannot assume that the remote information system is capable of keeping and perhaps more importantly returning states. In the Information Mediation workflow system we partially dealt with this problem by having the local workflow engine keep state by default, and each time a business process is invoked on a remote system, a local state is copied and sent to the remote object. We still have not discovered a practical solution in making sure that we receive back an updated copy of the state.

Another problem introduced by interoperability and remote subprocesses is exceptions handling. Even though CORBA 2.0 supports some exceptions handling, it is not sufficient in this case as components in remote subprocesses may be disconnected to the CORBA object implementation. Again, we have not yet come up with a suitable solution.

Additional Process Component Interfaces

In addition to object interfaces, there are two other mechanisms to integrate remote components and business processes. These mechanisms are RPC and TCP driven. Java RMI is also under consideration.

Interactive Transactions

So far we have mentioned both information and knowledge gathering and sharing, but not yet storage, which is equally as important in an information system. The Information Mediation framework uses transactions to deal with storage management.

The transaction management component of the Information Mediation framework acts like a resource shared by all participants of the framework. It provides tools to store and retrieve data, information and knowledge into and from a database, as well as methods to perform content negotiation between multiple components, locally and distributely. The ability to support content negotiation is important in an information rich environment, where interactive exchange of information is often desired. Consider the following situation: two medical labs comparing lab test results, and will continue their efforts based on the results of the other lab. Traditional information systems allow them to transfer all the data at once, wait, then receive all the data back, and so on. The Information Mediation framework, with its transaction model, allows the two labs to exchange small pieces of information interactively and keep a persistent connection open until all the information exchange transactions are completed.

The Information Mediation framework's implementation of transaction processing is based on TCP transport for content negotiation, and JDBC for database connectivity. We picked these two methods for different reasons. For content negotiation we desired a fast and reliable method for communication, and for database connectivity a non-system dependent tool is demanded.

4. Current Work and Application

The Information Mediation framework described above is currently being implemented at Sholink Corporation. The framework is being used to drive a healthcare management product, ShoHealth.

The ShoHealth product is used for information management controls in the healthcare industry, especially for information transaction applications such as lab tests transactions, patient record transactions, insurance processing, disease tracking, etc. For more information, contact info@sholink.com.

5. Conclusion and Future Works

In this paper we have presented rationales for the Information Mediation framework. The framework is build around three components: knowledge domain integration, achieved through a workflow engine; knowledge domain interoperability, a service provided to the core engine, achieved through distributed objects; and knowledge domain transactions, a communication layer build to allow data storage and transfers, achieved through a TCP implementation of transaction processing system. These three components together define a process we call Information Mediation, in which knowledge and information are separated from data, and sharing of knowledge and information can be established beyond the boundaries of physical organization.

Many areas still need to be addressed, such as the problems created from interoperability and use of remote process components. Other areas include security, database management and more elaborate transaction process support ([7]).

6. Acknowledgments

The authors would like to thank the following people for their insights, criticisms and implementation work: Xianfa Deng, Karen Cheng, Hassan Schroeder and Pat Friend. Larry Davison also contributed some invaluable ideas to the project early on.

Reference

Elements of Information Theory, Thomas M. Cover, Joy A. Thomas, Wiley, 1991
Computation Structures, Steven Ward, Robert H. Halstead, Jr., McGraw Hill, 1990
"Knowledge-Domain Interoperability and an Open Hyperdocument System", Douglas C. Engelbart, Proceedings of the Conference on Computer-Supported Cooperative Work, Los Angles, CA, 1990, pp 143-156.
"The Workflow Reference Model", David Hollingsworth, The Workflow Management Coalition Specification, WFMC, 1994
The Essential Distributed Objects Survival Guide, Robert Orfali, Dan Harkey, Jeri Edwards, Wiley, 1996
"CORBA 2.0 Specifications", The Object Management Group, 1996
Transaction Processing: Concepts and Techniques, Jim Gray, Andreas Reuter, Morgan Kaufmann Publishers, 1993

Return to Top of Page
Return to Posters Index

Information Mediation Knowledge Domain Integration, Interoperability and Transactions