Constructing a Corporate Memory Infrastructure from Internet Discovery Technologies

presented by

Minh Huynh, Marble Associates, Inc.

Laird Popkin, Marble Associates, Inc.

Matthew Stecker, Marble Associates, Inc.

ABSTRACT

While the advent of powerful, sophisticated Internet discovery technologies has prompted many organizations to look to the Internet as a source of information and as a medium for publishing information, a particularly powerful application of these same Internet technologies has been largely overlooked. Many corporations today are in fact microcosms of the Internet in their use of heterogeneous, geographically disparate networks. Internet discovery technologies already provide the mechanisms necessary to construct a information infrastructure that allows the corporation to quantify, qualify, and leverage its existing information resources distributed across a currently unmanageable set of machines.

At Marble Associates, we have augmented and integrated existing Internet discovery technologies (Mosaic, Web, and WAIS) to form a Corporate Memory Infrastructure that (1) captures and leverages existing information assets, whereby promoting reuse, and (2) facilitates information dissemination across organizational boundaries. The framework, consisting of methodologies as well as enabling technologies, allows our employees to access information resources by content and attribute, to submit new or modified resources, and to relate interdependent resources. This same framework, when designed correctly, incurs little overhead in maintenance and use, further amplifying its benefits.

This paper examines the problems with leveraging information in the past and presents the need for and a description of a Corporate Memory Infrastructure (CMI). The business case for instituting a Corporate Memory Infrastructure is compelling. This need, coupled with the fact that many of the components are publicly available, presents tremendous opportunities for Mosaic, Web, and other Internet discovery technologies in the corporate arena.

1.0 The Business Imperative

As today's corporations move from hierarchical models of operation to flatter, leaner, and more flexible business models, their growing information needs magnify the need for sharing and disseminating information across organizational boundaries. Internal and external competitive pressures require new, time-constrained response and, as a result, more empowered knowledge workers. The emerging model is typically team-based (cross-functional rather than departmental) and employs distributed decision making as opposed to centralized control. In addition, the information pool for an organization can be boundless, often encompassing many sources well outside of the organization itself.

The emerging operating model demands a high degree of information reuse and sharing of "live"data. Individual workers must have access to a vast set of corporate resources, with appropriate filtration, flexible search mechanisms, and notification capabilities. These needs mandate a new technology-one that enables the sharing and the distribution of information (which is itself not so new) but with

access mechanisms that are flexible and semantically rich,
ingestion of resources into the corporate-wide pool that are dynamic, and
management of the resource set, because of its scale, that is automated.

A Corporate Memory Infrastructure addresses this problem, providing efficient, meaningful resource sharing while incurring minimal overhead.

2.0 Specifications for a Candidate Solution

Comprehensive resource sharing solutions must solve three aspects of the problem:

access (simply getting to the resource) - users must be able to locate resources of interest;
content (understanding and manipulating the resource) - users must be able to open a file in an appropriate application for viewing and modification; and
context (understanding various circumstances under which the resource was created) - users must have access to attribute information that is not embedded in the resource itself.

Networks have provided users with access; in the most literal sense of resource sharing, users can physically move files from place to place. Information dissemination and searching methods, under most networking schemes, are still primitive and unstructured, ranging from verbal communication to email and news services.

The issue of content is solved more by policy than technology; typically, an organization decides on a few cross-platform formats based on applications that are popular and common to its user community. Invariably, users will want to use applications and formats that fall outside of corporate standards. It is this data that is typically most critical, since the number of users who comprehend its content and context are in the minority. Hence, it is this data that demands context-rich presentation.

The problem of context is addressed by metadata. Metadata (or attribute data) is simply information about the resources. Such examples are a binary file's purpose, version, authors, associated project, etc. Metadata provides the context for locating and understanding resources. Users must be shown this information on the retrieval of a resource and also be allowed to search according to attributes. Metadata is organization-specific; each user community will determine what attributes and values are most useful for its resource pool and culture. Furthermore, a candidate solution must have the flexibility to provide this context for a constantly changing resource set.

In addition to the problem of context, a candidate solution must provide flexible means of finding a user's resources of interest:

notification of relevant or time-sensitive information - this mechanism allows a passive user to be updated based on interest profiles (subscription services) as well as on a "what's new" basis;
traditional directory browsing - users must be allowed to revert to the traditional act of "walking the file system" to locate resources of interest;
content search - the system must allow content queries on textual resources; and
attribute search - the system must allow users to retrieve resources based on their attributes.

In addition to these access mechanisms, the candidate solution should adhere to some practical guidelines. Obviously, any plausible system must work with the applications and work processes that already exist in the organization, complementing (rather than supplanting) the structures and mechanisms already in place by which a corporation captures and uses data. Typically, this means that the solution system must operate across a number of computer platforms and connectivity mechanisms.

Second, the creation of meaningful metadata should incur minimal overhead - the system should extract as much information about the resources as possible automatically, and users should be able to submit and modify metadata for their own resources conveniently.

Additionally, the metadata scheme must be extensible to accommodate future keys and values as the corporation grows to assimilate new types of resources and as its resource creation and management methodologies evolve.The scheme should also allow users to interrelate their resources to express inter-dependencies.

3.0 Marble's CMI

Marble's CMI is our implementation of a solution system that aspires to the specification listed above. Properly defined, our CMI is not a product so much as a combination of management discipline and enabling technologies that helps us foster a high degree of reuse and efficient communication. In its generic form, CMI is a framework that organizations will evolve to meet their specific information needs.

3.1 Architectural Overview

Figure 1 presents the abstracted, simplified architecture of Marble's CMI. In the ensuing discussion, we will examine each component and interaction in detail. Section 3.2 illustrates the access mechanisms and traces their individual flow through the architecture.

FIGURE 1. CMI Components

Users access the system through a uniform Web (HTML/HTTP) interface. The Web server employs WAIS to perform content searches. The addition of a metadata subsystem captures and retrieves attribute information, allowing the creation of HTML navigation and retrieval pages on-the-fly. Additionally, our CMI institutes a number of facilities for the generation, submission, and synchronization of metadata. The following sections will illustrate the added functionalities.

3.2 Access Mechanisms

As each user accesses CMI through the Web browser, a few top-level navigation and introductory pages informs him of the possible uses, searching avenues, and categories of information available. Figure 2 shows the home page for Marble's CMI (which we call KnowNet), where we have categorized a sample resource set into Softproduct (documents), Hardproduct (source code, binaries, etc) and External Resources. Navigating down through the hierarchy of HTML pages, the user will encounter certain search pages, depending on his choices.

The WAIS search page (Figure 3 ) is well understood by the Web community. The user simply enters search terms and submits the query. Typically, the search returns a list of qualified resources and their numeric relevance ranking. The resource names are formatted as links to the actual files, facilitating retrieval.

In Marble's CMI, we take additional steps to display some relevant metadata and to present the relevance ranking in a graphical manner (Figure 4 ). On return, we display some metadata for the qualifying resources - authors, publication, date, and synopsis (abbreviated). The example in Figure 3 performs a WAIS search for the term "ISDN" and finds a staff member biography and a product information sheet as the first two qualifiers. Given the synopsis for a resource, the user can make an informed decision as to whether or not he should retrieve the resource. This informed decision is critical if the resources are large and the search terms general. Many members of our staff dial into the corporate LAN via SLIP lines, through which the retrieval of large, unwanted resources is prohibitively expensive and time-consuming.

The search launched by the user through Web triggers the WAIS subsystem to perform the search in the standard manner. On return, our modifications examine the list of qualified resources and relevance ranking and, for each resource, retrieves the metadata. In addition, we take the relevance ranking and produce a bar graph (preformatted GIF) that reflects the percentage within 5%.

FIGURE 2. The Home Page for Marble's CMI

FIGURE 3. Content Search via WAIS

FIGURE 4. Content Search Results

Alternately, the user may search for resources by attribute (Figure 5). Here, the Mosaic form takes search parameters for fields in the organization's metadata scheme and launches a metadata query. On return, Perl scripts format the results to display a context-rich page much like that returned by our content search.

In Figure 5, the user selects "Communications of the ACM" in the Publication field, leaves the "Author" and "Synopsis" fields blank, and hits the SUBMIT button. On return (Figure 6), we see a list of qualified resources displayed by title, replete with author names, descriptions (abbreviated), publication dates, etc. This search hits a set of magazine articles we ingested from a Computer Select CD-ROM. We have produced a set of utilities that automate the process of parsing articles and extracting metadata.

FIGURE 5. Metadata Search

FIGURE 6. Metadata Search Results

An additional access mechanism we provide is browsing by metadata. The user can browse resources much like file browsing, only we replace the Unix "ls" command that populates a node with a metadata query that lists available resources given the selected query constraints. Figure 7 shows this capability. The page lists the set of attributes. When the user selects "Author", the next page (Figure 8) lists the possible values for authors in the system (for the Softproduct resource set).

FIGURE 7. Metadata Browsing

FIGURE 8. Metadata Browsing (continued)

The user can then select an "Author" value such as "marc", and the attribute-listing page returns (Figure 9). Note that the first URL on the page reports the current query criteria (Author = "marc"). The user can view all resources that qualify for that criteria by selecting that URL. Alternatively, the user can specify additional search constraints by successively choosing attributes and values.

The system thus generates this portion of its navigation structure dynamically, eliminating any manually maintained list of references. The maintenance of HTML pages and their references, as we have found, imposes considerable overhead.

FIGURE 8. Metadata Browsing (continued)

While this mechanism is flexible, the organization can fix the order of keys, effectively providing a hierarchical browsing capability based on metadata. For instance, the user can select "Author", view a list of values for "Author", and select the value "marc" from that list. The next page would display values for "Publication" for which "Author = marc". At any point, the user may launch a search with the constructed criteria or continue to descend and constrain the search.

3.3 Additional Components

A number of daemons facilitate the management of CMI and its contents. Aside from the WAIS indexing run nightly, we have utilities that synchronize the file system and our metadata, which is implemented in a relational database. One such daemon checks for metadata blocks of files that have been deleted from the file system and removes such blocks. Conversely, another daemon detects new resources in the file system and generates "skeletal" metadata based on some simple rules. A third daemon detects metadata entries that are "skeletal" and sends a form to the files' owners requesting additional attribute information. A fourth daemon processes the replies from users and submits the users' attribute data into CMI.

The set of daemons, combined with dynamic formatting of HTML, makes the system self-maintaining. The only modifications we make are adding keys to the metadata scheme or rearranging the topology of the toplevel, introductory pages.

3.4 Future Work

As the reader may have already noticed, our combination of metadata and the standard file system effectively synthesizes an N-dimensional, end-user customizable file system. The Unix file system already has metadata in its inodes (owner, date, size, etc). A file system that allows the user to extend the metadata fields would replace our modelling of attribute information in a relational database (which incurs the overhead of keeping the file system and database synchronized). This file system would need to support the various uses of metadata by providing, at a minimum, the types allowed by our relational database: unlimited length text, fixed-length strings, integers, and dates.

Marble is also exploring the use of a object-database to model our metadata scheme. The use of a relational database appeals to many corporations, since the technology is well-understood and vendors are readily available. However, modelling multi-valued attributes with an arbitrary number of values (Author = "marc" "kirk" "ray") is more natural in an object oriented database. Additionally, a robust object oriented database could be extended to contain the resources themselves as well as the metadata. This would, in effect, compose an extensible semantic file system, with the advantages listed above.

Additionally, versioning is not a service provided by our CMI. At the time, we considered using RCS, SCCS, or PCVS as the underlying archival system. The original difficulty lies in Marble's REFERENCES field - a multi-valued metadata field that allows users to interrelate and package resources. While we can certainly offer the user a view of all versions of a resource, the overhead of keeping versioning information of its dependencies intact and attached to that resource was daunting. Fortunately, several high-end versioning systems (notably ClearCase) can solve many of those problems by versioning entire directories, retrieving a set of interrelated resources (each with specific versions), and providing limited metadata extensibility. This last capability of ClearCase, although limited, is the closest contender to an N-dimensional, user-extensible file system we have yet to find.

Current work on CMI includes investigation of the above options as well as its integration with active information assets (live channels such as news feeds and email).

Concurrent with these efforts is the delivery of CMI to one of our clients. This particular client will use CMI to bridge geographically disparate development sites. Specifically, the client will use CMI to unite and normalize source code libraries and resultant binaries across constituent sites and publish documentation about the corporate-wide code set.

For additional information...

please contact

cmi-team@marble.com

or write to

Marble Associates, Inc.

2929 Campus Drive, Suite 245

San Mateo, CA 94403

Marble Associates, Inc.

950 Winter Street, Suite 1700

Waltham, MA 02154

About the authors

Minh Huynh, a member of Marble)s Pacific Region, joined the consulting firm in 1992. His focus, echoing with the primary business of Marble Associates, largely concerns reengineering business processes in large scale corporations. As such, he is particularly interested in bringing leading edge technologies to the corporate arena as they mature to solve business, process, and organizational problems.

Minh was the project lead on the development CMI at Marble. His involvement focused on the design and implementation of the metadata subsystem and the extension CMI from a simple publication tool towards a semantic filesystem.

Minh holds the B.S. degree in Electrical Engineering and the B.A. in Computer Science from Rice University.

Laird Popkin is a technology consultant and engineer for Marble Associates, Inc. His interests lies in designing and building the tools to integrate heterogeneous computer networks including personal computers, Unix workstations and massively parallel supercomputers, as well as database-, file- and print-servers. He was responsible for designing and implementing Macintosh SCSI mass storage device drivers, tools for device verification and manufacture, and user applications. He has coordinated production of linear and interactive (touchscreen) multimedia presentations incorporating data visualization and video production, using stand-alone Macintosh computers, video monitors and walls, and heterogeneous networks of Macintosh computers, Sun and SGI Unix workstations, and Unix supercomputers.

Laird is currently the development manager on Marble's delivery of CMI to a client organization. His involvement in the development of CMI includes the design and implementation the Web and WAIS interfaces, the dynamically generated navigation structure, and the HTML pages that interface with the metadata subsystem.

Laird holds the B.S. degree in Mathematics with a concentration in Computer Science from the University of North Carolina at Greensboro.

Matthew Stecker came to Marble in 1993, with both consulting experience in an object oriented environment and a diverse background in systems technology. Matthew has worked on several large scale reengineering engagements, helping to match IS requirements to business value through the use of rapid application development and rapid prototyping. Matthew has consulted for a range of industries, including VMS programming for the United States State Department and IBM mainframe development for a large Philadelphia retail chain.

Matthew has provided many valuable insights that shaped the design and implementation of CMI, with particular emphasis on an organization's information needs, policies, standards, and methodologies.

Matthew holds the B.A. degree in Computer Science and Political Science from Duke University and holds the J.D. degree from The University of North Carolina at Chapel Hill.