neuss@igd.fhg.de
http://www.igd.fhg.de/~neuss/
Robert E. Kent
rekent@logos.ualr.edu
http://logos.ualr.edu/rekent.html
In order to implement sophisticated retrieval engines, a means of automatic analysis and classification of document meta-information has to be found. We propose the use of methods from the mathematical theory of concept analysis to analyze and interactively explore the information space defined by wide-area resource discovery services.
The World Wide Web is an Internet based information system for access to distributed hyperlinked documents. Its ease of use and the ability of seamlessly integrating other information services have paved the way for the widespread popularity the Web currently enjoys. Finding specific information however is a difficult task that cannot be accomplished through browsing. To address this problem, various mechanisms for resource discovery have been developed.
A naming scheme, such as the IETF's (Internet Engineering Task Force) Uniform Resource Name URN supports the concept of a persistent location-independent resource identifier. Directory resolution services map resource names ``back'' to actual physical locations (resource instances). By encapsulating resource meta-information together with resource names, attributes, such as ``title'', ``author'' or ``topic'', etc. are available to search engines. In order to implement sophisticated retrieval engines, a means for the automatic interpretation and classification of meta-information must be found. Furthermore, an adequate metaphor for presenting and exploring information structures must be developed.
This paper discusses recent developments in the area of information resource discovery services, and proposes the use of methods from the mathematical theory of concept analysis to process and interactively browse a large information space. The remainder of this paper is organized as follows: After giving a short overview on formats for defining document related meta-information, we analyze state of the art systems for resource discovery in the Web. Then, we introduce how Concept Analysis can be used for automatically structuring and browsing resource meta-information.
The section introduces various formats for resource meta-information that are in use on the Internet and in library science.
We here want to apply ideas from Concept Analysis to the on-going discussions in the IETF-IIIR and IETF-URI working groups in general, and the specification by Michael Mealling [Me94a,Me94b] of URCs in particular. We use the following definitions.
The Internet Anonymous FTP Archives (IAFA) working group of the IETF has proposed a format for indexing information that can be used to describe various Internet resources. The IAFA template specification [DeEm94] encodes pieces of meta-information as a record of attribute-value pairs. Among them are
The IAFA templates are intended to be both human and machine readable. Archie servers support this format to provide information about items available for anonymous ftp. Work is currently underway for the construction of Uniform Resource Identifiers.
Harvest (see section 3.3) is a set of tools to gather, extract, and search relevant information across the Internet. It provides methods for distributed indexing, building topic specific indices, flexible search strategies, and replica systems. Harvest generates a content summary for each information object it gathers. These records contain attribute-value pairs and are stored in a format called the Summary Object Interchange Format (SOIF).
SOIF is based on a combination of the IAFA templates (see previous section) and BibTeX (a program to automatically compile a bibliography file for use with the LaTeX document preparation system [La86]). Unlike IAFA templates, SOIF templates support attribute values with arbitrary content, allowing them to span multiple lines and contain non-ASCII characters.
Furthermore, SOIF permits streams of object summaries, while IAFA templates contain only individual items. This allows for a more efficient Gatherer/Broker communication.
Figure 1: Excerpt from sample SOIF file
Figure 1 shows an example of an SOIF template [BDHMS94b].
Below, classified according to the eight areas of description of ISBD in Library Science, we list some attributes which might be relevant for a particular purpose:
Due to the rapid growth of the World Wide Web in 1994, resource discovery has become a serious problem. Because of its decentralized architecture, the user experiences the Web as a large information repository without an underlying structure. The process of ``surfing'' pages by repeatedly following links is the most popular use of the Web. It can however lead to the phenomenon of getting ``lost in hyperspace''.
From the very beginning, approaches have been made to organize information about networked information resources into catalogs and indexes. Index files were originally maintained manually. However, the rapid growth of the Web soon made necessary automatic methods for generating resource directories. Automatic tools called ``robots'', ``Web wanderers'' or ``spiders'' soon evolved. These are programs which automatically connect to a remote server and recursively retrieve documents. Since spider programs often put heavy loads on Web servers, they have been controversial, and are sometimes disliked by server maintainers.
The main problem with ``spiders'' is that they are nor ``true Web wanderers'' --- the retrieval program does not transfer itself from the index site to the provider site, but instead transfers in the reverse direction over the network all the potentially indexable documents. Since document repositories may contain hundreds of megabytes, the bandwidth requirements are enormous. Exacerbating this problem is the fact that current indexing tools gather independently, without sharing information with other indexers.
The WNILS[1] working group in the Internet Engineering Task Force (IETF) is currently defining a standard for creation of a distributed directory service called WHOIS++. It defines the notion of ``centroid'' as a mechanism for passing index information between index servers. A centroid is a list of records, attributes, and a word list for each attribute.
The word list for a given attribute contains one occurrence of every word which appears at least once in any record in the database for the attribute. In order to optimize searching, this abstracted information is passed up a hierarchical tree to other servers.
Although WHOIS++ defines a general purpose directory service, it can be employed to provide a Web specific resource discovery service. This can be achieved through a URN to URL mapping directory. By gathering and distributing URC information through a hierarchy of WHOIS++ servers, resource directories based on attributes such as ``title'' and ``keywords'' become possible.
HARVEST [BDHMS94a,BDHMS94b] is a distributed system for resource discovery and indexing. It separates the task of obtaining and distributing data: A gatherer collects information from a provider, while a broker provides a query interface for index requests. This approach has a variety of advantages. First of all, being able to run the gatherer at the provider site reduces server load and network traffic. Secondly, since a gatherer can feed information to many brokers, some of the redundancy described in the previous section can be avoided. Figure 2 (from [BDHMS94b]) demonstrates this configuration. Not only can brokers access more than one gatherer, but they can also be used to cascade indexed views from other brokers, using their query interface.
Figure 2: HARVEST Architecture
Not unlike the WHOIS++ approach, HARVEST uses a record consisting of attribute/value pairs as a unit for information indexing. They are stored in a format called the Summary Object Interchange Format (SOIF) (see section 2.3). SOIF provides a means of bracketing collections of summary objects, allowing Harvest Brokers to retrieve SOIF content summaries for many objects in a single, efficient compressed stream. An example SOIF object can be found in figure 1.
Table 1 lists some analogies between various components of resource discovery services. It provides an orientation towards the specification of resource discovery services in a distributive fashion as a federation of resource discovery software agents.
Table 1: Analogies between Resource Discovery Services
Using ideas from Library Science and Concept Analysis, we are currently developing tools for the conceptual analysis of networked information resources in general, and the World Wide Web in particular. Networked information resources [WeDe94] include (1) individual text files, (2) WAIS databases, and (3) starting points for hypertext webs. In this section we argue by example that resources are best thought of, not as objects, but as conceptual classes (concepts). We offer a concept-oriented approach for the description and organization of networked information resources, which will facilitate their subsequent discovery and access. This should not be thought of as yet another object-oriented approach. Although objects generate their own classes, classes are not only more general but also include intensional information. By identifying concepts with classes, this can be regarded as a class-oriented approach --- an approach that has been advocated recently by Terry Winograd in the IETF-URI working group discussion on library standards and URI, and supported by Ronald Daniel and Dirk Herr-Hoyman.
Concept Analysis [Wi82] is a relatively new discipline arising out of the mathematical theory of lattices and category theory. It is closely related to the areas of knowledge representation in Computer Science and Cognitive Psychology. Concept Analysis provides for the automatic classification of both knowledge and documents via representation of a users faculty for interpretation as encoded in conceptual scales. Such conceptual scales correspond to the facets of synthetic classification schemes, such as Ranganathan's COLON scheme, in Library Science.
Concept Analysis uses objects, attributes and conceptual classes as its basic constituents. A conceptual class consists of any group of entities or objects exhibiting one or more common characteristics, traits or attributes. A characteristic is a conceptualized attribute by which classes may be identified and separated into a conceptual hierarchy, and further subdivided (specialized) by the facets of topic, form, location, chronology, etc. The ``has'' relationship between objects and attributes is represented as a binary relation called a formal context.
A formal context is a triple (G,M,I) consisting of two sets G and M and a binary incidence relation I element GxM between G and M. Intuitively, the elements of G are thought of as entities or objects, the elements of M are thought of as properties, characteristics or attributes that the objects might have, and gIm asserts that ``object g has attribute m.'' In many contexts appropriate for Web documents, the objects are documents and the attributes are any interesting properties of those documents.
The definition of a conceptual class must involve: the common attributes, which are encoded in the superordinate (next higher and more general class), and the distinguishing attributes, which differentiate the defined concept from the superordinate. Conceptual classes are logically characterized by their extension and intension.
The process of subordination of conceptual classes and collocation of objects exhibits a natural order, proceeding top-down from the more general classes with larger extension and smaller intension to the more specialized classes with smaller extension and larger intension. This order is called generalization-specialization. One class is more specialized (and less general) than another class, when its intent contains the other's intent, or equivalently, when the opposite ordering on extents occurs. Conceptual classes with this generalization-specialization ordering form a class hierarchy for the formal context. Knowledge is here represented as the hierarchical structure, or complete lattice, known as the lattice of conceptual classes of the formal context. Such lattices of classes can be visualized by line diagrams, where nodes represent conceptual classes and edges represent the subclass (subtype) relationship.
The join of a collection of conceptual classes represents the common attributes or shared characteristics of the classes. The bottom of the conceptual hierarchy (the empty join) represents the most specific class whose intent consists of all attributes and whose extent is often empty. The meet of a collection of conceptual classes represents the conjunction of all the attributes of the classes. The top of the conceptual hierarchy (the empty meet) represents the universal class whose extent consists of all objects. The entire conceptual class hierarchy is implicitly specified by the ``has'' relationship of the formal context. However, part of the hierarchy of conceptual classes could also be explicitly specified via the following top-down process.
Table 2 lists two generic interchange formats which can be used to specify faceted information in conceptually scaled networked information resources. Such faceted information can occur in various interfaces in a resource discovery system. From a mathematical viewpoint, these two representations are equivalent to each other. Software exists for converting between the two forms. The left side of table 2 displays the Formal Context Interchange Format (FCIF). FCIF is oriented towards the formal contexts of Concept Analysis. FCIF represents order-theoretic formal contexts of networked information resources, consisting of two partially ordered sets, a poset of objects and a poset of single-valued attributes, and an order-preserving incidence matrix which represents the has relationship between objects and attributes. The right side of table 2 displays the Concept Lattice Interchange Format (CLIF). CLIF is oriented towards the concept lattices of Concept Analysis. CLIF provides a storage-optimal representation of order-theoretic lattices of conceptual classes for networked information resource meta-information, consisting of (the inverse relationships for) two generator monotonic functions, from the posets of objects and attributes to the lattice of conceptual classes, and a successor matrix which represents the subtype relationship between conceptual classes.
In this section we demonstrate that FCIF and CLIF subsume both the Uniform Resource Characteristics of WHOIS++ and the Summary Object Interchange Format of HARVEST. These interchange formats are more general mechanisms than either URCs or SOIFs, and allow for the specification of more complex conceptually structured systems of resources. Actually, as Figure 3 points out, both FCIF and CLIF are better thought to occur after conceptual scaling, whereas both URC and SOIF specify ``raw meta-information'' which exists before conceptual scaling [GaWi89]. From the philosophical viewpoint of Concept Analysis, conceptual scaling is an act of interpretation, which maps raw uninterpreted data, such as occurs in URC or SOIF, to a user's view. URC and SOIF represent entity relations, whereas FCIF and CLIF represent incidence data between objects and attributes. These attributes are simple structured queries of the form tag#value, where # is any relational operator +, <=, etc. The equality operator represents nominal scaling, whereas the inequality operator <= represents ordinal scaling [GaWi89]. Through conceptual scaling, which often is just nominal or ordinal scaling, we can compare FCIF and CLIF with URC and SOIF.
The OBJECT section of the FCIF specifies, in a transposed fashion, a required ordering relationship between resources: Oi1 <= Oi2 iff Oi1 is listed in the row indexed by Oi2
This OBJECT section can specify both generalization-specialization and part-whole relationships between resources. When parts are typed, part-whole relationships can be embedded as generalization-specialization relationships --- the whole is a special case of an object which has that part. An important example of generalization-specialization relationships occurs in WHOIS++-URC. Here the instantiation relationship between an URN and its URL instances {URL1, URL2,...,URLu} is a generalization-specialization relationship. This instantiation relationship can be specified by adding a row indexed by the URN to the OBJECT section of the FCIF:
OBJECT ... URN {URL1,URL2,...,URLu} ...An important example of whole-part relationships occurs in HARVEST SOIF. Here the embedded relationship that occurs between an archived directory structure and one of its summarized files is a whole-part relationship. This directory-file whole-part relationship can be specified in an analogous way in the OBJECT section of the FCIF.
Figure 3: Conceptual Scaling with Various Interchange Formats
The example in the left side of table 3 of a URC is taken from the Internet Draft document [Me94b] of the IETF. This URC contains two instances of the resource whose title and author are given below the URN attribute. It describes the resource entitled ``Intro to FTP and Telnet'' by author Adam Arrowood, and is available via anonymous ftp in postscript form and via http as an HTML document. After suitable analysis and interpretation via conceptual scaling of this bibliographic record data, a conceptual structure such as in the right side of table 3 might be visualized. Here the lattice order from the URL object concepts to the URN object concept represents the precedence order in the URC. The conceptual class labeled by the author attribute is distinct from the URN object concept, since it represents all the resources which Adam Arrowood has created or authored. As is appropriate, the location is an attribute for the URLs, but not the URN.
From the viewpoint of Concept Analysis, both resources represented by URNs and instances of resources represented by URLs are objects, whereas meta-information in the form of <multi-valued, attribute> pairs are (single-valued) attributes. Using precedence rules URCs represent two kinds of relationships: the ``has'' or ``having'' relationship between a URI and its meta-information: and the ``instantiation'' relationship between a resource represented by an URN and one of its instances represented by a URL.
Since SOIF specifies only single objects, it is a special case of the FCIF format, and is not able to specify order information either between objects or between attributes. Table 4 demonstrates this fact. On the left side of table 4 is SOIF of some type TYPE. On the right side of table 4 is the corresponding FCIF format for this SOIF. The attributes here are the ones used to query the broker(s). The objects here are object summaries in the returned query results. We see that the SOIF specifies mainly the incidence matrix.
In the follow-up discussion of the IETF-URI working group Dirk Herr-Hoyman created the simple example in table 5 for the specification of a URC using TEI-like syntax. The nesting here verifies the simple fact that URLs are objects with their own attributes. Table 6 demonstrates how the URC data in table 5 might be represented in FCIF.
Stuart Weibel, in the charter for the OCLC-NCSA metadata workshop in March 1995, has discussed the desirability of a taxonomy of resource types. The complexity of resource meta-information depends upon its intended use. Too little information will fail to meet some purposes; too much information is a burden for systems and is costly to generate and maintain. This suggests the desirability, or even the necessity, for the development of a hierarchy of resource meta-information types. As Weibel suggests, resource meta-information might be promoted from a simple to a more complex level as the result of user demand or attention from a cataloging or archival agency. Levels of such a hierarchy might be defined by several criteria: purpose, cost, origin, etc.
BibTeX [Le79] provides a very simple example for a taxonomy of types of resource meta-information. Different types of publications require different information: journal article meta-information has a volume, but book meta-information does not. For each resource type in BibTeX, attribute tags are divided into required, optional, and ignored. Table 7 represents the BibTeX formal context, whose objects are the BibTeX entry types, and whose attributes are the required BibTeX fields (required BibTeX attribute tags). Figure 4 displays the line diagram of the lattice of classes for this BibTeX formal context.
Here, the ``misc'' BibTeX entry type labels the top lattice node, since it is the most general type in terms of required tags (fields) --- it has none. The BibTeX ``article'' entry type is more specialized than the BibTeX ``proceedings'' entry type, since in addition to ``year'' and ``title'' tags, it also requires ``journal'' and ``author'' tags. All absolutely non-required tags (tags not required by any type) label the bottom lattice node. This BibTeX concept lattice line diagram reveals an important idea: the discovery by conceptual analysis of new resource types --- the unlabelled lattice node in the center of the diagram represents an interesting but unspecified type, whose required tags are ``author'', ``title'', and ``year''.
Table 2: Interchange Formats for Faceted Resource Meta-information
Table 3: Conceptual Scaling of a URC
Table 4: SOIF and the corresponding
FCIF
Table 5: Specification of a
URC (TEI-SGML)
Table 6: Specification of a URC
(FCIF)
Table 7: formal context for BibTeX types and required tags
Figure 4: lattice of classes for BibTeX types and required tags
In this paper we have demonstrated that Concept Analysis provides a solid foundation for the development of a principled approach towards the specification of networked information resource meta-information. We have discussed in particular, how the meta-information in two well-known resource discovery services, WHOIS++ and HARVEST, can both be subsumed in a more general and well-structured approach which uses ideas from Concept Analysis. Future work will aim at developing a graphical user interface for a resource discovery vizualisation system built on top of these services.