Creating a Web Analysis and Visualization Environment

Robert E. Kent
assistant professor
University of Arkansas
Little Rock, Arkansas 72204, USA
rekent@logos.ualr.edu

Christian Neuss
researcher
Fraunhofer Institute for Computer Graphics,
Darmstadt, Germany
neuss@igd.fhg.de

Abstract:

Due to the rapid growth of the World-Wide Web, resource discovery has become an increasing problem. As an answer to the demand for information management, a third generation of World-Wide Web tools will evolve: information gathering and processing agents. This paper describes WAVE (Web Analysis and Visualization Environment), a 3D interface for World-Wide Web information visualization and browsing. It uses the mathematical theory of concept analysis to conceptually cluster objects. So-called ``conceptual scales'' for attributes, such as location, title, keywords, topic, size, or modification time, provide a formal mechanism that automatically classifies and categorizes documents, creating a conceptual information space. A visualization shell serves as an ergonomically sound user interface for exploring this information space.

1 Introduction

The World-Wide Web has gained its amazing popularity through the availability of ``point and shoot'' browsing tools like Mosaic. They provide access to information sources all over the globe via a simple, easy-to-use graphical interface. The information space that is available to Web users is enormous, and now that editing tools offer word processor like ease-of-use its growth rate will certainly accelerate. Thus, the user will literally drown in an ocean of information - the experience of getting ``lost in hyperspace'' is probably already familiar to most Web citizens.

In order to provide a facility for resource discovery, so called meta index servers have been set up which maintain lists of references to other servers and resources. There is however the problem that these references become stale whenever documents move or are deleted. Thus, some servers have automated this process by retrieving and parsing documents in regular intervals. A popular example is the searchable CUI W3 catalog (at http://cui_www.unige.ch/w3catalog) maintained by the University of Geneva. Another approach is Martijn Koster's ALIWEB [Ko94], which creates an archie-like indexing facility: the ALIWEB server regularly retrieves index files from other servers and combines them into a searchable database. An interesting implementation of client-based search is the fish search [BrPo94] extension to Mosaic which was developed at the Eindhoven University of Technology in the Netherlands: It enhances the Mosaic Web browser with a functionality for robotic searches on a remote archive (see section 2). Use of a local cache avoids multiple accesses to the same document.

These tools allow for topic search based upon keywords, and return a list of resource references which meet a given condition. However, since the result of a keyword search is often too large to be handled conveniently, there is a need for a more refined method of categorizing and managing information. By employing autonomous agents that retrieve and automatically analyze information from the Web, a sophisticated system for information retrieval and management can be created. This paper describes WAVE, a prototype implementation of a Web Analysis and Visualisation Environment.

The process of automated document analysis can be divided into three phases:

acquisition of raw data
automatic analysis and classification
visualization and interactive browsing

The following sections will discuss in detail how these phases are implemented in WAVE.

2 Information Gathering

An information analyzer needs to gather raw data from remote information repositories. We discuss here three approaches for this task: spiders, indexing tools, and software agents.

The most straightforward approach for gathering information is to use an automatic spider program to retrieve documents from a remote host. A Web Spider program (also referred to as a Web Wanderer or Web Robot) is a routine that recursively retrieves documents. The term ``web wanderer'' is a bit misleading: these programs do not cross machine boundaries, but are executed on the local machine. In order to search a World Wide Web archive, they have to transfer a whole document tree over the network. So these programs are very bandwidth intensive and have an adverse impact on the server's processing resources. The bandwidth requirements can be reduced by having a robot program access a local Proxy cache [LuAl94] instead of directly connecting to a remote site. Especially when re-analyzing a remote document repository, use of cache mechanisms significantly reduces the required bandwidth by limiting document transfers to those files which actually have changed in the meantime.

A second possibility is the employment of automatically created indexes. Document indexing tools such as WAISINDEX [Kah91] or ICE [NeHo94] allow for making keyword based topic searches, but can also be used to provide a well defined interface for operations such as returning the list of document URLs in the archive, together with some attributes (attribute:value pairs) such as size, date of last modification, etc. The current implementation of WAVE is based upon an ICE index which contains many relevant attributes of a Web document. This approach gives maximum performance when accessing a local document repository, since it does not require an HTTP transaction. In order to be able to access other data repositories, interfaces to WAISINDEX and a simple robotic spider front-end have also been generated. For example, some basic attributes of Web documents as provided by the ICE indexer are: the URL path component, the document title, the time of last modification (in seconds since Jan 1970, a format common for UNIX systems), and the file size in bytes. In addition, keywords matching a query are included, prefixed with their frequency. None of these attributes are specific to ICE. The same raw data could just as well have been retrieved through the WAISINDEX query interface or the spider front-end.

Finally, there is a third possible mechanism for gathering information which seems a bit exotic at first glance: software agents capable of crossing the border between machines and executing their search task on the remote target. Such an agent could truly be called a ``web wanderer''. Since the search is being executed on the target machine, only the results need to be transfered back. This greatly reduces the bandwidth requirements. Such agents can be implemented as scripts, which are interpreted by a ``script engine'' in a save environment located on the remote host.

The next section will discuss how the raw data about Web documents, which has been gathered by devices such as described here, is preprocessed in order to generate conceptual scales to be used for the classification of Web documents.

3 Interpretation and Classification

Ideally, an information processing agent should be able to actually read and to a certain degree ``understand'' a document. Although recent developments by artificial intelligence researchers in the area of machine translation are promising, current technology is based upon the use of heuristic methods for the classification of documents.

In WAVE we combine concepts, techniques, and processes from both traditional Library Science (cataloging and classification) [Wy80] and the relatively new discipline of Concept Analysis [GaWi89]. Collections of Web documents should be arranged according to some system, and such an arrangement is referred to as a classification. The purpose of classification is to make each document readily available to the user. The goal of WAVE is the arrangement of Web documents into conceptual classes which exhibit all of the resemblances and differences essential to their full comprehension by the user.

3.1 Conceptual Classes

A conceptual class consists of any group of entities or objects exhibiting one or more common characteristics, traits or attributes. A characteristic is a conceptualized attribute by which classes may be identified and separated into a conceptual hierarchy, and further subdivided (specialized) by the facets of topic, form, location, chronology, etc. The ``has'' relationship between objects and attributes is represented as a binary relation called a formal context. A formal context is a triple (G,M,I) consisting of two sets G and M and a binary incidence relation I between G and M. Intuitively, the elements of G are thought of as entities or objects, the elements of M are thought of as properties, characteristics or attributes that the objects might have, and gIm asserts that ``object g has attribute m.'' In many contexts appropriate for Web documents, the objects are documents and the attributes are any interesting properties of those documents.

The definition of a conceptual class must involve: the common attributes, which are encoded in the superordinate (next higher and more general class), and the distinguishing attributes, which differentiate the defined concept from the superordinate. Conceptual classes are logically characterized by their extension and intension.

The extension of a class is the aggregate of entities or objects which it includes or denotes.
The intension of a class is the sum of its unique characteristics, traits or attributes, which, taken together, imply the concept signified by the conceptual class.

The intent should contain precisely those attributes shared by all objects in the extent, and vice-versa, the extent should contain precisely those objects sharing all attributes in the intent. Clearly the terms ``extension'' and ``intension'' are reciprocally dependent. They complement each other by reciprocally deliminating concepts and explicating definitions. A conceptual class will consist of such an extent/intent pair.

The process of subordination of conceptual classes and collocation of objects exhibits a natural order, proceeding top-down from the more general classes with larger extension and smaller intension to the more specialized classes with smaller extension and larger intension. This order is called generalization-specialization. One class is more specialized (and less general) than another class, when its intent contains the other's intent, or equivalently, when the opposite ordering on extents occurs. Conceptual classes with this generalization-specialization ordering form a class hierarchy for the formal context. Knowledge is here represented as the hierarchical structure known as a complete lattice, and called the concept lattice of the formal context.

The join of a collection of conceptual classes represents the common attributes or shared characteristics of the classes. The bottom of the conceptual hierarchy (the empty join) represents the most specific class whose intent consists of all attributes and whose extent is often empty. The meet of a collection of conceptual classes represents the conjunction of all the attributes of the classes. The top of the conceptual hierarchy (the empty meet) represents the universal class whose extent consists of all objects. The entire conceptual class hierarchy is implicitly specified by the ``has'' relationship of the formal context. However, part of the hierarchy of conceptual classes could also be explicitly specified via the following top-down process [Wy80].

Initialization: The main top-level attribute classes are specified. These are meet-irreducible classes, meaning that they cannot be expressed as the meet of other more general classes.
Iteration: Any collection of (super)classes can be specialized by the specification of differentiating attributes, thus producing subclasses. Each such differentiated (sub)class is subordinate to every (super)class in the collection.
Termination: Continue until further specialization and differentiation is either impossible or impractical.

3.2 Conceptual Scaling

The general perspective of Concept Analysis fits closely with the ``information workspace'' paradigm, as described in [RCM93]. The interpretive act (called conceptual scaling in Concept Analysis) is all-important, since this is the way the user makes sense of his world of data. Interpretation can automatically and implicitly define the classification and categorization of data objects, and hence is more fundamental than classification. Interpretation is effected by conceptual scales. A natural approach toward the enrichment of interpretation and classification uses ideas from fuzzy sets and rough sets [Ke94].

A conceptual scale is a single isolated trait, property, or use, which is distinct from other characteristics. It is a kind of filter for a single conceptual dimension of data. We identify the notion of conceptual scale from Concept Analysis with the notion of facet from Library Science. The collection of conceptual scales in use (often chosen by the user) defines a certain view of the universe of Web documents. Here we list some examples of Web-related conceptual scales.


attributes              domain
---------------------------------------------------
hemisphere              (eastern, western)
continent               (...,europe,...,america:north...)
country                 (...,britain, germany, united_states,...)
state, province, land   (...,scottland, hessen, alaska,...)
city                    (...,edinburgh, frankfuert, anchorage,...)

Table 1: The multivalued attributes for the Gorgraphical Scale

... ... continent:asia hemisphere:eastern ... ... country:britain continent:europe ... ... state:alaska country:united_states ... ... city:frankfurt land:hessen

Table 2: The File "geographical.scale.tbl" for the Gorgraphical Scale

The location of HTML documents could be scaled either nationally, geographically, net-wise, or other. This example discusses the standard hierarchical approach to geographical facets or scales on the Web. This hierarchy, as many other hierarchical scales (without instances), has a common set of attributes and objects, resulting in a square incidence matrix in its formal context. When instances are included, the context will no longer be square. The multi-valued attributes for the geographical scale in Table 1 have the functional dependencies
city => state,
state => country,
country => continent,
continent => hemisphere,

These multi-valued attributes will each individually be nominally scaled, providing the notion of levels in the overall hierarchical scale. This hierarchical scale is assembled by instantiating the functional dependencies. The pairings of inclusion, which are placed in an incidence matrix file (*.tbl), are listed in Table 2. If these attributes (and objects) are sorted by level, from larger to smaller areas, the resulting incidence matrix will be block lower triangular. If we partition the attributes by level, resulting in a collection of conceptual scales, then by using nested line diagrams we can abstractly visualize the hierarchy. By dropping attributes right-to-left, we can effect a kind of abstraction-by-restriction.
The URLs of Web documents can be scaled. The information contained in URLs, and more completely in UR*s ( URIs, URNs, URLs, and especially URCs), corresponds to the bibliographic records in library catalogs of Library Science.
1. The URL naming scheme component is scaled in Figure 1.
2. Instantiation of the geographical scale with respect to the hostname component of URLs is a kind of morphism of contexts location: url => city. For example, an instance of this is http://www.cern.ch/ => www.cern.ch => geneva. When this context morphism is combined with the geographical scale mentioned above, the URL hostname component is scaled hierarchically by level.
3. The URL path component can be conceptually scaled. Although it does not (yet) contain explicit semantics, and may not even map to a physical directory structure, it usually still reflects a hierarchy of documents. It is safe to assume that documents residing in the same URL ``directory'' are related in some way, and that the hierarchy in the URLs reflects a logical, hierarchical clustering of documents.
In addition to URL scales for documents, a user may also be interested in size scales, form scales, time scales, etc. But most important from the standpoint of semantics will be the various subject, content, or topic scales (compare the subject headings for a classification system such as Dewey Decimal, Library of Congress, Bliss, Colon, etc.).

Figure 1: Naming Scheme Scale

Conceptual scales can be combined in various ways. However, the most useful way for the classification of Web documents is by apposition of conceptual scales [GaWi89]. Apposition is a kind of product or conjunction of conceptual scales. It combines the various relevant and purposeful conceptual data dimensions into a clear, unambiguous aggregate which allows for several visual abstractions.

Figure 2: score & size scales: Ethnic Cooking

Table 3: Results of WAISINDEX search

For an example of apposition of conceptual scales, consider Table 3. Table 3 shows the results of a WAISINDEX search in the library usenet-cookbook.src using the set of keywords ``garlic'', ``fish'', ``rice'', and ``onion''. The documents are recipes, the score is the WAIS relative score, and the size is the number of lines in the recipe. This example was discussed by Ed Krol [Kr94] as an example of WAIS search. In Figure the scores are scaled in an ordinal scale and and the document size is scaled with an interordinal scale. Figure 3 uses apposition of the score and size scales in Figure 2 with the idea of nesting to visualize in a line diagram (concept lattice) the conceptual scaling of the results of a WAISINDEX search in Table 3 using the scales in Figure 2.

Figure 3: Conceptual Scaling of a search result: Ethnic Cooking

Based upon their structural properties, conceptual scales can be classified into several kinds. Conceptual scales (the mechanism for interpretation) can be implemented as autonomous interpretative agents. Conceptual scales can themselves be scaled by kind - interpretative meta-agents can control interpretative agents. From a utilitarian standpoint some of the more important kinds of scales are the following.

Nominal scales represent partition and independence. The URL naming scheme scale in Figure 1 is nominally scaled.
Ordinal scales (one-dimensional) represent ranking. The score scale for the Ethnic Cooking search example in Figure 2 is ordinally scaled. Figure 3 shows at a glance that ``CUBAN-BEANS'' is ranked higher than ``CAJUN-LAMB'' with respect to score.
Interordinal scales represent betweenness. The size scale for the Ethnic Cooking search example in Figure 2 is interordinally scaled. Again, Figure 3 shows at a glance that the ``CHICKEN-VINDAL'' recipe is between the ``BOUILLABAISSE'' recipe and the ``CAJUN-LAMB'' recipe in size.
Hierarchical scales represent, of course, hierarchical information; either in the single-inheritance case of a tree hierarchy, or in the multi-inheritance case. The geographical scale of server site addresses in Table 1 and Table 2 is hierarchically scaled by level.

3.3 Types of Classification

Referential classification [Wy80] is a pragmatic and empirical system in which objects are related with reference to a chosen collection of conceptual scales. In referential classification, various external relations, user preferences, and the environment, are all important to the act of interpretation and classification. Any Web document may be meaningful in any number of different relationships, depending upon the immediate purpose of the user. To support this flexible interpretation, classification in WAVE allows the user to define a new view which is based upon a different collection of conceptual scales. So WAVE is a referential classification system.

A facetted classication scheme tends to restrict explicit designations to single, unsubdivided classes. These are called meet irreducible conceptual classes in the class hierarchy. This list of designations can be identified with attribute names, since all meet irreducible classes are labelled by such names. Facetted classification schemes rely upon synthesis - the combining of various facets (conceptual scales) for the specification and construction of conceptual classes. The Colon classification scheme of Ranganathan is one example of facetted classification from Library Science. The WAVE system is another example of facetted classification.

3.4 The Process of Interpretation and Classification

The WAVE process of classifying a Web document with conceptual scales is an act of interpretation. This process, which constructs classes and embeds them in conceptual class hierarchies, involves the synthesis of conceptual classes using conceptual scales. It consists of the following steps.

Analysis:
1. Gather into a cache all relevant information about a document: physical information, content, form, etc.
2. Break down the document information into its component parts. Resolve the information into ``atomic units''; usually nouns and descriptive adjectives, or linguistic variables and linguistic values.
Synthesis
1. Identify the appropriate conceptual scale or facet for each atomic unit of information.
2. Filter the document information through the conceptual scales.
3. Construct the conceptual class of the document by forming the lattice meet in the class hierarchy of the collection of attribute conceptual classes of the facetted or scaled document information.

Through the use of conceptual scaling the WAVE system builds up and synthesizes the features of Web documents - such as purpose, form, location, estimated access time, size, time of last revision, etc. - into a conceptual structure, a systematized and orderly whole.

4 Visualization and Browsing

The final step in automatic document analysis is the interactive presentation and exploration of results. A subset of facets can be chosen and starting from these, a local environment of related items can be explored. As an additional method for handling large amounts of data, a 3D browser can then be used to navigate in the information space.

4.1 Interactive Browsing

When browsing a very large data repository, it can be desirable to select a set of starting objects and to interactively explore their neighborhood. This process consists of the following steps:

Initialization

The facets are evaluated with respect to the data acquired in phase one and transformed into binary relations (formal contexts)Ki = (Gi,Mi,Ii), 1<=i<=.
The evaluated facets Ki, 1<=i<=n are composed into a single binary relation K = (G,M,I), M = M1 + M2 + Mn using the operation of apposition (this operation requires the contexts to share a common object set).
A global analysis is performed on the total context K, chiefly in terms of the collection of local neighborhood concept lattices.

Browse Loop

The local neighborhood of a given seed object is analyzed and previewed. To simplify the visualization data to be presented to the user (and possibly reach an acceptable number of concepts), the local neighborhood is modified using various means: raising the connectivity threshold, rank-ordering the attributes and restricting to the most important ones, restricting to a ball around the seed induced by a similarity metric, etc.
The local neighborhood is visualized. At this time the user may want to visualize the union context of the local neighborhoods for the old seed and the new seed - this allows comparison of ``distance'' moved and concepts in common.
Finally, a new seed is chosen. This may be either an object or an attribute.

4.2 The Visualization Process

The increased processing speed of modern workstations and specialized graphics hardware allows for interactive 3D visualization and navigation, and for creation of a 3-dimensional information workspace metaphor. Free navigation in the information space provides an intuitive means of centering in on interesting sub-areas (thus providing a fisheye view of the graph [Fu86]). The use of 3D perspective allows at the same time for holding more objects on the screen then a 2-dimensional layout does. The interactive browsing component of WAVE uses a conventional graph layout algorithm to compute coordinates of information nodes on a plane, which are rendered as three-dimensional objects. Figure 4 shows a screenshot of the WAVE 3D browser. Size, shape and color of these objects can be varied to reflect semantics; e.g. relevance or document size.

By selecting an object node, information associated with the object can be queried. Besides being able to freely move around in the 3D space, the user can also decide to let the system warp smoothly to a given information node. Future work on the WAVE navigator will be directed at including true three-dimensional layout of information; e.g. three-dimensional representation of hierarchies as ``information cone trees'' [RCM93].

5 Conclusions and Future Work

In this paper we have introduced WAVE, a system for the conceptual analysis and automatic classification of documents on the World-Wide Web. We have introduced the process of constructing conceptual structures of Web documents in terms of the three phases: acquisition of raw data; analysis and classification; and visualization and interactive browsing. By means of the theoretical underpinning provided by Concept Anlysis, we have focussed on the close relationship between the WAVE system and the techniques of cataloging and classification in Library Science. More specifically, we have discussed the interpretative nature of the conceptual scaling process, called facetted classification in Library Science. Finally, we have given an overview of the user interface and visualization components of WAVE.

The most urgent task left unfinished at this time is work on the design and implementation of software for the creation of conceptual scales. Conceptual scales, which are known as facets in Library Science, provide the user with interpretations and views of data. As one might expect, the creation of conceptual scales is a social process. Users will often disagree about how to interpret data. There will often be strong contention between community standards and individual preferences. As a result, scale creation software must have a large user interface component.

Another very important task to be completed in the near future is the implementation of navigation over a true three-dimensional representation of conceptual hierarchies.

References

BrPo94: De Bra, P.M.E.; Post, R.D.J.: Information Retrieval in the World Wide Web: Making Client Based searching feasible, Proceedings of the First International World-Wide Web Conference, CERN, Switzerland, May 1994
Fu86: Furnas, G. W.: Generalized fisheye views, CHI '86 Conference, Boston, April 1986
GaWi89: Ganter, B.; Wille, R.: Conceptual Scaling, Applications of Combinatorics and Graph Theory in the Biological and Social Sciences, Springer, New York, 1989
Kah91: Kahle, Brewster: An Information System for Corporate Users: WAIS, Thinking Machines Corporation, April 1991
Ke94: Kent, Robert E.: Enriched Interpretation, accepted for presentation at the Third International Workshop on Rough Sets and Soft Computing (RSSC'94), November 10-12, 1994, San Jose, California
Ko94: Koster, Martijn: ALIWEB - Archie-Like Indexing in the WEB, Proceedings of the First International World-Wide Web Conference, CERN, Switzerland, May 1994
Kr94: Krol, Ed: The Whole Internet User's Guide & Catalog, 2nd edition, O'Reilly & Associates, Sebastopol, California 1994
LuAl94: Luotonen, Ari; Altis, Kevin: World-Wide Web Proxies, Proceedings of the First International World-Wide Web Conference, CERN, Switzerland, May 1994
NeHo94: Neuss, Christian; Höfling, Stefanie: Lost in Hyperspace? Free Text Searches in the Web, Proceedings of the First International World-Wide Web Conference, CERN, Switzerland, May 1994
Wy80: Wynar, B.S.: Introduction to Cataloging and Classification, 6th ed., Libraries Unlimited, Littleton, Colorado 1980
RCM93: Robertson, George G.; Card, Stuart K; Mackinlay, Jock D.: Information Visualizing using 3D Interactive Animation, Communications of the ACM, April 93, Vol 36. No. 4