Robert E. Kent
assistant professor
University of Arkansas
Little Rock, Arkansas 72204, USA
rekent@logos.ualr.edu
Christian Neuss
researcher
Fraunhofer Institute for Computer Graphics,
Darmstadt, Germany
neuss@igd.fhg.de
The World-Wide Web has gained its amazing popularity through the availability of ``point and shoot'' browsing tools like Mosaic. They provide access to information sources all over the globe via a simple, easy-to-use graphical interface. The information space that is available to Web users is enormous, and now that editing tools offer word processor like ease-of-use its growth rate will certainly accelerate. Thus, the user will literally drown in an ocean of information - the experience of getting ``lost in hyperspace'' is probably already familiar to most Web citizens.
In order to provide a facility for resource discovery, so called meta index servers have been set up which maintain lists of references to other servers and resources. There is however the problem that these references become stale whenever documents move or are deleted. Thus, some servers have automated this process by retrieving and parsing documents in regular intervals. A popular example is the searchable CUI W3 catalog (at http://cui_www.unige.ch/w3catalog) maintained by the University of Geneva. Another approach is Martijn Koster's ALIWEB [Ko94], which creates an archie-like indexing facility: the ALIWEB server regularly retrieves index files from other servers and combines them into a searchable database. An interesting implementation of client-based search is the fish search [BrPo94] extension to Mosaic which was developed at the Eindhoven University of Technology in the Netherlands: It enhances the Mosaic Web browser with a functionality for robotic searches on a remote archive (see section 2). Use of a local cache avoids multiple accesses to the same document.
These tools allow for topic search based upon keywords, and return a list of resource references which meet a given condition. However, since the result of a keyword search is often too large to be handled conveniently, there is a need for a more refined method of categorizing and managing information. By employing autonomous agents that retrieve and automatically analyze information from the Web, a sophisticated system for information retrieval and management can be created. This paper describes WAVE, a prototype implementation of a Web Analysis and Visualisation Environment.
The process of automated document analysis can be divided into three phases:
The following sections will discuss in detail how these phases are implemented in WAVE.
An information analyzer needs to gather raw data from remote information repositories. We discuss here three approaches for this task: spiders, indexing tools, and software agents.
The most straightforward approach for gathering information is to use an automatic spider program to retrieve documents from a remote host. A Web Spider program (also referred to as a Web Wanderer or Web Robot) is a routine that recursively retrieves documents. The term ``web wanderer'' is a bit misleading: these programs do not cross machine boundaries, but are executed on the local machine. In order to search a World Wide Web archive, they have to transfer a whole document tree over the network. So these programs are very bandwidth intensive and have an adverse impact on the server's processing resources. The bandwidth requirements can be reduced by having a robot program access a local Proxy cache [LuAl94] instead of directly connecting to a remote site. Especially when re-analyzing a remote document repository, use of cache mechanisms significantly reduces the required bandwidth by limiting document transfers to those files which actually have changed in the meantime.
A second possibility is the employment of automatically created indexes. Document indexing tools such as WAISINDEX [Kah91] or ICE [NeHo94] allow for making keyword based topic searches, but can also be used to provide a well defined interface for operations such as returning the list of document URLs in the archive, together with some attributes (attribute:value pairs) such as size, date of last modification, etc. The current implementation of WAVE is based upon an ICE index which contains many relevant attributes of a Web document. This approach gives maximum performance when accessing a local document repository, since it does not require an HTTP transaction. In order to be able to access other data repositories, interfaces to WAISINDEX and a simple robotic spider front-end have also been generated. For example, some basic attributes of Web documents as provided by the ICE indexer are: the URL path component, the document title, the time of last modification (in seconds since Jan 1970, a format common for UNIX systems), and the file size in bytes. In addition, keywords matching a query are included, prefixed with their frequency. None of these attributes are specific to ICE. The same raw data could just as well have been retrieved through the WAISINDEX query interface or the spider front-end.
Finally, there is a third possible mechanism for gathering information which seems a bit exotic at first glance: software agents capable of crossing the border between machines and executing their search task on the remote target. Such an agent could truly be called a ``web wanderer''. Since the search is being executed on the target machine, only the results need to be transfered back. This greatly reduces the bandwidth requirements. Such agents can be implemented as scripts, which are interpreted by a ``script engine'' in a save environment located on the remote host.
The next section will discuss how the raw data about Web documents, which has been gathered by devices such as described here, is preprocessed in order to generate conceptual scales to be used for the classification of Web documents.
Ideally, an information processing agent should be able to actually read and to a certain degree ``understand'' a document. Although recent developments by artificial intelligence researchers in the area of machine translation are promising, current technology is based upon the use of heuristic methods for the classification of documents.
In WAVE we combine concepts, techniques, and processes from both traditional Library Science (cataloging and classification) [Wy80] and the relatively new discipline of Concept Analysis [GaWi89]. Collections of Web documents should be arranged according to some system, and such an arrangement is referred to as a classification. The purpose of classification is to make each document readily available to the user. The goal of WAVE is the arrangement of Web documents into conceptual classes which exhibit all of the resemblances and differences essential to their full comprehension by the user.
A conceptual class consists of any group of entities or objects exhibiting one or more common characteristics, traits or attributes. A characteristic is a conceptualized attribute by which classes may be identified and separated into a conceptual hierarchy, and further subdivided (specialized) by the facets of topic, form, location, chronology, etc. The ``has'' relationship between objects and attributes is represented as a binary relation called a formal context. A formal context is a triple (G,M,I) consisting of two sets G and M and a binary incidence relation I between G and M. Intuitively, the elements of G are thought of as entities or objects, the elements of M are thought of as properties, characteristics or attributes that the objects might have, and gIm asserts that ``object g has attribute m.'' In many contexts appropriate for Web documents, the objects are documents and the attributes are any interesting properties of those documents.
The definition of a conceptual class must involve: the common attributes, which are encoded in the superordinate (next higher and more general class), and the distinguishing attributes, which differentiate the defined concept from the superordinate. Conceptual classes are logically characterized by their extension and intension.
The process of subordination of conceptual classes and collocation of objects exhibits a natural order, proceeding top-down from the more general classes with larger extension and smaller intension to the more specialized classes with smaller extension and larger intension. This order is called generalization-specialization. One class is more specialized (and less general) than another class, when its intent contains the other's intent, or equivalently, when the opposite ordering on extents occurs. Conceptual classes with this generalization-specialization ordering form a class hierarchy for the formal context. Knowledge is here represented as the hierarchical structure known as a complete lattice, and called the concept lattice of the formal context.
The join of a collection of conceptual classes represents the common attributes or shared characteristics of the classes. The bottom of the conceptual hierarchy (the empty join) represents the most specific class whose intent consists of all attributes and whose extent is often empty. The meet of a collection of conceptual classes represents the conjunction of all the attributes of the classes. The top of the conceptual hierarchy (the empty meet) represents the universal class whose extent consists of all objects. The entire conceptual class hierarchy is implicitly specified by the ``has'' relationship of the formal context. However, part of the hierarchy of conceptual classes could also be explicitly specified via the following top-down process [Wy80].
The general perspective of Concept Analysis fits closely with the ``information workspace'' paradigm, as described in [RCM93]. The interpretive act (called conceptual scaling in Concept Analysis) is all-important, since this is the way the user makes sense of his world of data. Interpretation can automatically and implicitly define the classification and categorization of data objects, and hence is more fundamental than classification. Interpretation is effected by conceptual scales. A natural approach toward the enrichment of interpretation and classification uses ideas from fuzzy sets and rough sets [Ke94].
A conceptual scale is a single isolated trait, property, or use, which is distinct from other characteristics. It is a kind of filter for a single conceptual dimension of data. We identify the notion of conceptual scale from Concept Analysis with the notion of facet from Library Science. The collection of conceptual scales in use (often chosen by the user) defines a certain view of the universe of Web documents. Here we list some examples of Web-related conceptual scales.
attributes domain --------------------------------------------------- hemisphere (eastern, western) continent (...,europe,...,america:north...) country (...,britain, germany, united_states,...) state, province, land (...,scottland, hessen, alaska,...) city (...,edinburgh, frankfuert, anchorage,...)Table 1: The multivalued attributes for the Gorgraphical Scale
Table 2: The File "geographical.scale.tbl" for the Gorgraphical Scale... ... continent:asia hemisphere:eastern ... ... country:britain continent:europe ... ... state:alaska country:united_states ... ... city:frankfurt land:hessen
city => state,
state => country,
country => continent,
continent => hemisphere,
These multi-valued attributes will each individually be nominally scaled, providing the notion of levels in the overall hierarchical scale. This hierarchical scale is assembled by instantiating the functional dependencies. The pairings of inclusion, which are placed in an incidence matrix file (*.tbl), are listed in Table 2. If these attributes (and objects) are sorted by level, from larger to smaller areas, the resulting incidence matrix will be block lower triangular. If we partition the attributes by level, resulting in a collection of conceptual scales, then by using nested line diagrams we can abstractly visualize the hierarchy. By dropping attributes right-to-left, we can effect a kind of abstraction-by-restriction.
Figure 1: Naming Scheme Scale
Conceptual scales can be combined in various ways. However, the most useful way for the classification of Web documents is by apposition of conceptual scales [GaWi89]. Apposition is a kind of product or conjunction of conceptual scales. It combines the various relevant and purposeful conceptual data dimensions into a clear, unambiguous aggregate which allows for several visual abstractions.
Figure 2: score & size scales: Ethnic Cooking
Table 3: Results of WAISINDEX search
For an example of apposition of conceptual scales, consider Table 3. Table 3 shows the results of a WAISINDEX search in the library usenet-cookbook.src using the set of keywords ``garlic'', ``fish'', ``rice'', and ``onion''. The documents are recipes, the score is the WAIS relative score, and the size is the number of lines in the recipe. This example was discussed by Ed Krol [Kr94] as an example of WAIS search. In Figure the scores are scaled in an ordinal scale and and the document size is scaled with an interordinal scale. Figure 3 uses apposition of the score and size scales in Figure 2 with the idea of nesting to visualize in a line diagram (concept lattice) the conceptual scaling of the results of a WAISINDEX search in Table 3 using the scales in Figure 2.
Figure 3: Conceptual Scaling of a search result: Ethnic Cooking
Based upon their structural properties, conceptual scales can be classified into several kinds. Conceptual scales (the mechanism for interpretation) can be implemented as autonomous interpretative agents. Conceptual scales can themselves be scaled by kind - interpretative meta-agents can control interpretative agents. From a utilitarian standpoint some of the more important kinds of scales are the following.
Referential classification [Wy80] is a pragmatic and empirical system in which objects are related with reference to a chosen collection of conceptual scales. In referential classification, various external relations, user preferences, and the environment, are all important to the act of interpretation and classification. Any Web document may be meaningful in any number of different relationships, depending upon the immediate purpose of the user. To support this flexible interpretation, classification in WAVE allows the user to define a new view which is based upon a different collection of conceptual scales. So WAVE is a referential classification system.
A facetted classication scheme tends to restrict explicit designations to single, unsubdivided classes. These are called meet irreducible conceptual classes in the class hierarchy. This list of designations can be identified with attribute names, since all meet irreducible classes are labelled by such names. Facetted classification schemes rely upon synthesis - the combining of various facets (conceptual scales) for the specification and construction of conceptual classes. The Colon classification scheme of Ranganathan is one example of facetted classification from Library Science. The WAVE system is another example of facetted classification.
The WAVE process of classifying a Web document with conceptual scales is an act of interpretation. This process, which constructs classes and embeds them in conceptual class hierarchies, involves the synthesis of conceptual classes using conceptual scales. It consists of the following steps.
Through the use of conceptual scaling the WAVE system builds up and synthesizes the features of Web documents - such as purpose, form, location, estimated access time, size, time of last revision, etc. - into a conceptual structure, a systematized and orderly whole.
The final step in automatic document analysis is the interactive presentation and exploration of results. A subset of facets can be chosen and starting from these, a local environment of related items can be explored. As an additional method for handling large amounts of data, a 3D browser can then be used to navigate in the information space.
When browsing a very large data repository, it can be desirable to select a set of starting objects and to interactively explore their neighborhood. This process consists of the following steps:
The increased processing speed of modern workstations and specialized graphics hardware allows for interactive 3D visualization and navigation, and for creation of a 3-dimensional information workspace metaphor. Free navigation in the information space provides an intuitive means of centering in on interesting sub-areas (thus providing a fisheye view of the graph [Fu86]). The use of 3D perspective allows at the same time for holding more objects on the screen then a 2-dimensional layout does. The interactive browsing component of WAVE uses a conventional graph layout algorithm to compute coordinates of information nodes on a plane, which are rendered as three-dimensional objects. Figure 4 shows a screenshot of the WAVE 3D browser. Size, shape and color of these objects can be varied to reflect semantics; e.g. relevance or document size.
By selecting an object node, information associated with the object can be queried. Besides being able to freely move around in the 3D space, the user can also decide to let the system warp smoothly to a given information node. Future work on the WAVE navigator will be directed at including true three-dimensional layout of information; e.g. three-dimensional representation of hierarchies as ``information cone trees'' [RCM93].
In this paper we have introduced WAVE, a system for the conceptual analysis and automatic classification of documents on the World-Wide Web. We have introduced the process of constructing conceptual structures of Web documents in terms of the three phases: acquisition of raw data; analysis and classification; and visualization and interactive browsing. By means of the theoretical underpinning provided by Concept Anlysis, we have focussed on the close relationship between the WAVE system and the techniques of cataloging and classification in Library Science. More specifically, we have discussed the interpretative nature of the conceptual scaling process, called facetted classification in Library Science. Finally, we have given an overview of the user interface and visualization components of WAVE.
The most urgent task left unfinished at this time is work on the design and implementation of software for the creation of conceptual scales. Conceptual scales, which are known as facets in Library Science, provide the user with interpretations and views of data. As one might expect, the creation of conceptual scales is a social process. Users will often disagree about how to interpret data. There will often be strong contention between community standards and individual preferences. As a result, scale creation software must have a large user interface component.
Another very important task to be completed in the near future is the implementation of navigation over a true three-dimensional representation of conceptual hierarchies.