Conceptual Analysis of Resource Meta-information

Christian Neuss
neuss@igd.fhg.de
http://www.igd.fhg.de/~neuss/
Robert E. Kent
rekent@logos.ualr.edu
http://logos.ualr.edu/rekent.html

Abstract:
With the continuously growing amount of Internet accessible information resources, locating relevant information in the World Wide Web becomes increasingly difficult. Recent developments provide scalable mechanisms for maintaining indexes of network-accessible information.

In order to implement sophisticated retrieval engines, a means of automatic analysis and classification of document meta-information has to be found. We propose the use of methods from the mathematical theory of concept analysis to analyze and interactively explore the information space defined by wide-area resource discovery services.

Keywords:
Resource Discovery, Retrieval

1 Introduction

The World Wide Web is an Internet based information system for access to distributed hyperlinked documents. Its ease of use and the ability of seamlessly integrating other information services have paved the way for the widespread popularity the Web currently enjoys. Finding specific information however is a difficult task that cannot be accomplished through browsing. To address this problem, various mechanisms for resource discovery have been developed.

A naming scheme, such as the IETF's (Internet Engineering Task Force) Uniform Resource Name URN supports the concept of a persistent location-independent resource identifier. Directory resolution services map resource names ``back'' to actual physical locations (resource instances). By encapsulating resource meta-information together with resource names, attributes, such as ``title'', ``author'' or ``topic'', etc. are available to search engines. In order to implement sophisticated retrieval engines, a means for the automatic interpretation and classification of meta-information must be found. Furthermore, an adequate metaphor for presenting and exploring information structures must be developed.

This paper discusses recent developments in the area of information resource discovery services, and proposes the use of methods from the mathematical theory of concept analysis to process and interactively browse a large information space. The remainder of this paper is organized as follows: After giving a short overview on formats for defining document related meta-information, we analyze state of the art systems for resource discovery in the Web. Then, we introduce how Concept Analysis can be used for automatically structuring and browsing resource meta-information.

2 Resource Meta-information

The section introduces various formats for resource meta-information that are in use on the Internet and in library science.

2.1 Uniform Resource Characteristics

We here want to apply ideas from Concept Analysis to the on-going discussions in the IETF-IIIR and IETF-URI working groups in general, and the specification by Michael Mealling [Me94a,Me94b] of URCs in particular. We use the following definitions.

2.2 IAFA Templates

The Internet Anonymous FTP Archives (IAFA) working group of the IETF has proposed a format for indexing information that can be used to describe various Internet resources. The IAFA template specification [DeEm94] encodes pieces of meta-information as a record of attribute-value pairs. Among them are

The IAFA templates are intended to be both human and machine readable. Archie servers support this format to provide information about items available for anonymous ftp. Work is currently underway for the construction of Uniform Resource Identifiers.

2.2 Harvest Summary Object Interchange Format

  Harvest (see section 3.3) is a set of tools to gather, extract, and search relevant information across the Internet. It provides methods for distributed indexing, building topic specific indices, flexible search strategies, and replica systems. Harvest generates a content summary for each information object it gathers. These records contain attribute-value pairs and are stored in a format called the Summary Object Interchange Format (SOIF).

SOIF is based on a combination of the IAFA templates (see previous section) and BibTeX (a program to automatically compile a bibliography file for use with the LaTeX document preparation system [La86]). Unlike IAFA templates, SOIF templates support attribute values with arbitrary content, allowing them to span multiple lines and contain non-ASCII characters.

Furthermore, SOIF permits streams of object summaries, while IAFA templates contain only individual items. This allows for a more efficient Gatherer/Broker communication.

  
Figure 1: Excerpt from sample SOIF file

Figure 1 shows an example of an SOIF template [BDHMS94b].

2.4 Bibliographic Records from Library Science

Below, classified according to the eight areas of description of ISBD in Library Science, we list some attributes which might be relevant for a particular purpose:

  1. title and statement of responsibility (Title, Author)
  2. edition (Version)
  3. material (or type of publication) specific details
  4. publication, distribution, etc.
  5. physical description (Content-Type, Content-Length, Size, Cost, etc.)
  6. series (Time To Live)
  7. notes (Abstract)
  8. standard number and terms of availability (Uniform Resource Names, Uniform Resource Locators)

3 Resource Discovery Services

Due to the rapid growth of the World Wide Web in 1994, resource discovery has become a serious problem. Because of its decentralized architecture, the user experiences the Web as a large information repository without an underlying structure. The process of ``surfing'' pages by repeatedly following links is the most popular use of the Web. It can however lead to the phenomenon of getting ``lost in hyperspace''.

3.1 Walking the Web

From the very beginning, approaches have been made to organize information about networked information resources into catalogs and indexes. Index files were originally maintained manually. However, the rapid growth of the Web soon made necessary automatic methods for generating resource directories. Automatic tools called ``robots'', ``Web wanderers'' or ``spiders'' soon evolved. These are programs which automatically connect to a remote server and recursively retrieve documents. Since spider programs often put heavy loads on Web servers, they have been controversial, and are sometimes disliked by server maintainers.

The main problem with ``spiders'' is that they are nor ``true Web wanderers'' --- the retrieval program does not transfer itself from the index site to the provider site, but instead transfers in the reverse direction over the network all the potentially indexable documents. Since document repositories may contain hundreds of megabytes, the bandwidth requirements are enormous. Exacerbating this problem is the fact that current indexing tools gather independently, without sharing information with other indexers.

3.2 Whois++-URC

The WNILS[1] working group in the Internet Engineering Task Force (IETF) is currently defining a standard for creation of a distributed directory service called WHOIS++. It defines the notion of ``centroid'' as a mechanism for passing index information between index servers. A centroid is a list of records, attributes, and a word list for each attribute.

The word list for a given attribute contains one occurrence of every word which appears at least once in any record in the database for the attribute. In order to optimize searching, this abstracted information is passed up a hierarchical tree to other servers.

Although WHOIS++ defines a general purpose directory service, it can be employed to provide a Web specific resource discovery service. This can be achieved through a URN to URL mapping directory. By gathering and distributing URC information through a hierarchy of WHOIS++ servers, resource directories based on attributes such as ``title'' and ``keywords'' become possible.

3.3 Harvest

  HARVEST [BDHMS94a,BDHMS94b] is a distributed system for resource discovery and indexing. It separates the task of obtaining and distributing data: A gatherer collects information from a provider, while a broker provides a query interface for index requests. This approach has a variety of advantages. First of all, being able to run the gatherer at the provider site reduces server load and network traffic. Secondly, since a gatherer can feed information to many brokers, some of the redundancy described in the previous section can be avoided. Figure 2 (from [BDHMS94b]) demonstrates this configuration. Not only can brokers access more than one gatherer, but they can also be used to cascade indexed views from other brokers, using their query interface.

  
Figure 2: HARVEST Architecture

Not unlike the WHOIS++ approach, HARVEST uses a record consisting of attribute/value pairs as a unit for information indexing. They are stored in a format called the Summary Object Interchange Format (SOIF) (see section 2.3). SOIF provides a means of bracketing collections of summary objects, allowing Harvest Brokers to retrieve SOIF content summaries for many objects in a single, efficient compressed stream. An example SOIF object can be found in figure 1.

Table 1 lists some analogies between various components of resource discovery services. It provides an orientation towards the specification of resource discovery services in a distributive fashion as a federation of resource discovery software agents.

  
Table 1: Analogies between Resource Discovery Services

4 Resources are Conceptual Classes

Using ideas from Library Science and Concept Analysis, we are currently developing tools for the conceptual analysis of networked information resources in general, and the World Wide Web in particular. Networked information resources [WeDe94] include (1) individual text files, (2) WAIS databases, and (3) starting points for hypertext webs. In this section we argue by example that resources are best thought of, not as objects, but as conceptual classes (concepts). We offer a concept-oriented approach for the description and organization of networked information resources, which will facilitate their subsequent discovery and access. This should not be thought of as yet another object-oriented approach. Although objects generate their own classes, classes are not only more general but also include intensional information. By identifying concepts with classes, this can be regarded as a class-oriented approach --- an approach that has been advocated recently by Terry Winograd in the IETF-URI working group discussion on library standards and URI, and supported by Ronald Daniel and Dirk Herr-Hoyman.

4.1 Concept Analysis

Concept Analysis [Wi82] is a relatively new discipline arising out of the mathematical theory of lattices and category theory. It is closely related to the areas of knowledge representation in Computer Science and Cognitive Psychology. Concept Analysis provides for the automatic classification of both knowledge and documents via representation of a users faculty for interpretation as encoded in conceptual scales. Such conceptual scales correspond to the facets of synthetic classification schemes, such as Ranganathan's COLON scheme, in Library Science.

Concept Analysis uses objects, attributes and conceptual classes as its basic constituents. A conceptual class consists of any group of entities or objects exhibiting one or more common characteristics, traits or attributes. A characteristic is a conceptualized attribute by which classes may be identified and separated into a conceptual hierarchy, and further subdivided (specialized) by the facets of topic, form, location, chronology, etc. The ``has'' relationship between objects and attributes is represented as a binary relation called a formal context.

A formal context is a triple (G,M,I) consisting of two sets G and M and a binary incidence relation I element GxM between G and M. Intuitively, the elements of G are thought of as entities or objects, the elements of M are thought of as properties, characteristics or attributes that the objects might have, and gIm asserts that ``object g has attribute m.'' In many contexts appropriate for Web documents, the objects are documents and the attributes are any interesting properties of those documents.

The definition of a conceptual class must involve: the common attributes, which are encoded in the superordinate (next higher and more general class), and the distinguishing attributes, which differentiate the defined concept from the superordinate. Conceptual classes are logically characterized by their extension and intension.

The intent should contain precisely those attributes shared by all objects in the extent, and vice-versa, the extent should contain precisely those objects sharing all attributes in the intent. Clearly the terms ``extension'' and ``intension'' are reciprocally dependent. They complement each other by reciprocally deliminating concepts and explicating definitions. A conceptual class will consist of such an extent/intent pair.

The process of subordination of conceptual classes and collocation of objects exhibits a natural order, proceeding top-down from the more general classes with larger extension and smaller intension to the more specialized classes with smaller extension and larger intension. This order is called generalization-specialization. One class is more specialized (and less general) than another class, when its intent contains the other's intent, or equivalently, when the opposite ordering on extents occurs. Conceptual classes with this generalization-specialization ordering form a class hierarchy for the formal context. Knowledge is here represented as the hierarchical structure, or complete lattice, known as the lattice of conceptual classes of the formal context. Such lattices of classes can be visualized by line diagrams, where nodes represent conceptual classes and edges represent the subclass (subtype) relationship.

The join of a collection of conceptual classes represents the common attributes or shared characteristics of the classes. The bottom of the conceptual hierarchy (the empty join) represents the most specific class whose intent consists of all attributes and whose extent is often empty. The meet of a collection of conceptual classes represents the conjunction of all the attributes of the classes. The top of the conceptual hierarchy (the empty meet) represents the universal class whose extent consists of all objects. The entire conceptual class hierarchy is implicitly specified by the ``has'' relationship of the formal context. However, part of the hierarchy of conceptual classes could also be explicitly specified via the following top-down process.

4.2 Specification of Resource Meta-information

Table 2 lists two generic interchange formats which can be used to specify faceted information in conceptually scaled networked information resources. Such faceted information can occur in various interfaces in a resource discovery system. From a mathematical viewpoint, these two representations are equivalent to each other. Software exists for converting between the two forms. The left side of table 2 displays the Formal Context Interchange Format (FCIF). FCIF is oriented towards the formal contexts of Concept Analysis. FCIF represents order-theoretic formal contexts of networked information resources, consisting of two partially ordered sets, a poset of objects and a poset of single-valued attributes, and an order-preserving incidence matrix which represents the has relationship between objects and attributes. The right side of table 2 displays the Concept Lattice Interchange Format (CLIF). CLIF is oriented towards the concept lattices of Concept Analysis. CLIF provides a storage-optimal representation of order-theoretic lattices of conceptual classes for networked information resource meta-information, consisting of (the inverse relationships for) two generator monotonic functions, from the posets of objects and attributes to the lattice of conceptual classes, and a successor matrix which represents the subtype relationship between conceptual classes.

In this section we demonstrate that FCIF and CLIF subsume both the Uniform Resource Characteristics of WHOIS++ and the Summary Object Interchange Format of HARVEST. These interchange formats are more general mechanisms than either URCs or SOIFs, and allow for the specification of more complex conceptually structured systems of resources. Actually, as Figure 3 points out, both FCIF and CLIF are better thought to occur after conceptual scaling, whereas both URC and SOIF specify ``raw meta-information'' which exists before conceptual scaling [GaWi89]. From the philosophical viewpoint of Concept Analysis, conceptual scaling is an act of interpretation, which maps raw uninterpreted data, such as occurs in URC or SOIF, to a user's view. URC and SOIF represent entity relations, whereas FCIF and CLIF represent incidence data between objects and attributes. These attributes are simple structured queries of the form tag#value, where # is any relational operator +, <=, etc. The equality operator represents nominal scaling, whereas the inequality operator <= represents ordinal scaling [GaWi89]. Through conceptual scaling, which often is just nominal or ordinal scaling, we can compare FCIF and CLIF with URC and SOIF.

The OBJECT section of the FCIF specifies, in a transposed fashion, a required ordering relationship between resources: Oi1 <= Oi2 iff Oi1 is listed in the row indexed by Oi2

This OBJECT section can specify both generalization-specialization and part-whole relationships between resources. When parts are typed, part-whole relationships can be embedded as generalization-specialization relationships --- the whole is a special case of an object which has that part. An important example of generalization-specialization relationships occurs in WHOIS++-URC. Here the instantiation relationship between an URN and its URL instances {URL1, URL2,...,URLu} is a generalization-specialization relationship. This instantiation relationship can be specified by adding a row indexed by the URN to the OBJECT section of the FCIF:

    OBJECT
      ...
      URN {URL1,URL2,...,URLu}
      ...
An important example of whole-part relationships occurs in HARVEST SOIF. Here the embedded relationship that occurs between an archived directory structure and one of its summarized files is a whole-part relationship. This directory-file whole-part relationship can be specified in an analogous way in the OBJECT section of the FCIF.

  
Figure 3: Conceptual Scaling with Various Interchange Formats

4.3 Examples

  
Table 2: Interchange Formats for Faceted Resource Meta-information

  
Table 3: Conceptual Scaling of a URC

  
Table 4: SOIF and the corresponding FCIF

  
Table 5: Specification of a URC (TEI-SGML)

  
Table 6: Specification of a URC (FCIF)

  
Table 7: formal context for BibTeX types and required tags

  
Figure 4: lattice of classes for BibTeX types and required tags

5 Conclusions

In this paper we have demonstrated that Concept Analysis provides a solid foundation for the development of a principled approach towards the specification of networked information resource meta-information. We have discussed in particular, how the meta-information in two well-known resource discovery services, WHOIS++ and HARVEST, can both be subsumed in a more general and well-structured approach which uses ideas from Concept Analysis. Future work will aim at developing a graphical user interface for a resource discovery vizualisation system built on top of these services.

References

BDHMS94a
Bowman, Mic; Danzig, Peter; Hardy, Darren; Manber, Udi; Schwartz, Michael: Harvest: A Scalable, Customizable Discovery and Access System, Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder, July 1994.
BDHMS94b
Bowman, Mic; Danzig, Peter; Hardy, Darren; Manber, Udi; Schwartz, Michael: The Harvest Information Discovery and Access System, Proceedings of the Second International WWW conference, pp. 763-771, Chicago, October 1994.
DeEm94
Deutsch, Peter; Emtage, Alan: Publishing Information on the Internet with Anonymous FTP, Bunyip Information Systems Inc., May 1994.
DeWe94
Deutsch, Peter; Schoultz, Rickard; Faltstrom, Patrik; Weider, Chris: Architecture of the Whois++ Service, Internet draft document draft-ietf-wnils-whois-arch-01.txt, July 25, 1994.
GaWi89
Ganter, B.; Wille, R.: Conceptual Scaling, Applications of Combinatorics and Graph Theory in the Biological and Social Sciences, Springer, New York, 1989.
Ke94
Kent, Robert E.: Enriched Interpretation, in the Proceedings of the Third International Workshop on Rough Sets and Soft Computing (RSSC'94), November 10--12, 1994, San Jose, California.
La86
Lamport, Leslie: LaTeX: A Document Preparation System, Addison Welsey, Reading, MA, 1986.
Le79
van Leunen, Mary-Claire: A Handbook for Scholars, Alfred A. Knopf, New York, 1979.
Ly93
Lynch, Clifford A.: A Framework for Identifying, Locating, and Describing Networked Information Resources, Internet Draft draft-ietf-lynch.overview.txt (March 24, 1993) of the Internet Engineering Task Force (IETF).
Me94a
Mealling, Michael: Specification of Uniform Resource Characteristics, Internet Draft draft-ietf-uri-urc-00.txt (April 5, 1994) of the Internet Engineering Task Force (IETF).
Me94b
Mealling, Michael: Encoding and Use of Uniform Resource Characteristics, Internet Draft draft-ietf-uri-urc-spec-00.txt (July 1, 1994) of the Internet Engineering Task Force (IETF).
Wy80
Wynar, B.S.: Introduction to Cataloging and Classification, 6th ed., Libraries Unlimited, Littleton, Colorado 1980.
WeDe94
Weider, Chris; Deutsch, Peter: Architecture of the Whois++ Index Service, Internet draft document draft-ietf-wnils-whois-03.txt, 1994.
Wi82
Wille, Rudolf: Restructuring Lattice Theory: An Approach Based on Hierarchies on Concepts, In I. Rival, editor, Ordered Sets, pp. 445--470. Reidel, Dordrecht-Boston, 1982.


Footnotes:
[1]
WHOIS and Network Information Lookup Service