Weblets: Fundamental Building Blocks for WWW Tools

Thomas J. Watt Jr. OSF/Research Institute 11 Cambridge Center Cambridge, MA 02142-1405 U.S.A.

watt@osf.org, URL: http://riwww.osf.org:8001/~watt

Abstract:: All distributed hypertext infostructures [Til94] have a connected structure. A weblet is a concept which captures this structure in a directed graph (DG) [Ber73]. This paper describes the design of a tool to capture the local structure and remote references of any Web delivered under an HTTP-compatible server; describes its applications; projects a future Web re-structuring tool; and mentions future advanced functionality. The Weblet tool operates sans HTTP[HTT94] server.
Keywords:: directed graph; distributed hypertext infostructure; web file system maintenance; weblet; web re-structuring tool.

1. Web Structure.

The structure of any Web is basically that of a DG, in almost all instances, a cyclic graph. This cyclic nature helps Web browsing agents to navigate different traversal points of interest perhaps aided by computable predicates. Any representation of Web structure must retain this cyclic nature. Computationally, the requirement is to avoid being trapped by the cyclic attribute.

2. Weblet Definition.

A weblet is the set of documents reachable from some starting set by hyperlinks satisfying some given criteria. A weblet expresses the notion of one or more hypertext-referenced objects, e.g. from a single HTML[HTM94] page to a complete local Web. What is referred to as a weblet is equivalent to any (connected) subgraph. Any subgraph of a Web is a weblet, including a Web, since a Web is a connected subgraph of itself.

2.1 Abstract Weblet.

Figure 1. illustrates the notion of an abstract weblet. The flow is generally from the top down, and the curved lines represent backward references. The straight lines to the side represent remote references. Back references to the root node "a" exist from the path between "e" and "a", and between "g" and "a". The relations between g->f, f->e, and e->f, f->g infer respectively the transitive relations g=>e and e=>g.

2.2 Concrete Weblet.

Figure 2. depicts a concrete representation of an abstract weblet. As shown in the figure, HTML files are represented as chunks, and chained together in a chunk list which is doubly linked. All chunks refer back to the Root chunk. Hyperlinks are chained off the chunks. Each hyperlink is doubly linked, and all hyperlinks refer back to the chunk that contains them.

3. The Weblet Tool Design.

3.1 Basic Design.

The design of the Weblet tool is straightforward. All locally known servers are encoded in a global table which is searched at the start of the program. After starting from any accessible HTML document in the Web, a chunk is created. When the initial chunk is created, we read the file into a buffer, then scan it for hyperlinks, creating and linking the hyperlink data structures to the chunk. After creating the hyperlink list hanging off the chunk, we rescan it, looking for relative names to normalize, testing and marking by type of hyperlink, ignoring any duplicates.

During the hyperlink scan of the chunk file, after resolving an appropriate path name for a hyperlinked file, we test for the file's existence with a stat call. Any file which is non-existent or has been renamed or moved to another directory will fall out of this test as a broken hyperlink. Remote hyperlinks are currently ignored in the initial version of the code.

When we have completed the hyperlink rescan on the first chunk, if we have any hyperlinks, a main loop is entered. If we have not marked a hyperlink to ignore it, which is the case concerning e.g. .gif, .ps, and other types of files; and if the file is in the file system; and if it meets the criteria of having a filename suffix of either ".htm" or ".html", we create another chunk, and link it to the previous chunk, repeating all of the above processing.

When we have discovered the last item in our Web, two outputs are emitted:

a list of broken hyperlinks, and
a list of connected files in the Web.

The list of connected files may be overridden to supply other information via command line options, not described here.

3.2 Weblet Tool Components and Data Structures.

The Weblet tool is designed with abstract data types around the following data structures:

A domain_port for server associated data,
A hyperlink data structure, and
A chunk data structure for HTML documents.

The operational components consist of several routines, including, a high performance hyperlink scanner; a hash table routine, applied separately, one to hyperlinks and the other to HTML chunks; and a set of routines based around the three main data structures.

4. Current Status of the Weblet Tool.

The Weblet tool has been under part-time development since mid-September 1994. It is an experimental research prototype. We have implemented several enhancements suggested for the Weblet Tool in a previous draft of this paper. They were:

Capture bracketed hyperlink
Command line options
Capture every unique hyperlink reference chain
Remote hyperlink liveness detection and validation
Flatten DG structure and provide memory-to-from-disk read/write routines

The first three are complete; we are currently working on the fourth, with some routines and algorithms ready for the fifth.

There is a high level WWW FORMS interface for users in our group.

5. Weblet Futures.

5.1. A Web Re-Structuring Tool.

Once the foundational changes above are in place, we intend to develop a Web re-structuring tool.

The basic idea is to provide a graphical user interface (GUI) for authors which not only depicts a graphical structure map of an existing Web, but is able to provide browsing and editing operations such as cut-and-paste, drag-and-drop. If a move button widget selects a hyperlinked object from somewhere in the Web, and its movement is graphically previewed to somewhere else; the hyperlink reference chain information collected for that hyperlinked object should be able to denote the location of where every subsequent broken hyperlink occurs.

We believe the tool should allow multiple concurrent authoring sessions on the same local Web, e.g. with appropriate locking functionality.

5.2. Advanced Functionality.

A structure map of a Web can serve a multitude of purposes:

It determines the reachability of Web hyperlink references and precisely delineates Web components. This information is useful for web file system maintenance.
It can support the constant authoring process by providing a basis for a Web re-structuring tool.
It can act as a gathering point for information in support of agent-oriented servers.
It can provide the synergy of an extensible framework to support groupware collaboration.
It can support the evolution of WWW Tool capabilities overall by providing an information abundant foundation for a repository with rapid query response.

A server oriented version of the tool should inter-cooperate with agents and other servers to provide some of these capabilities.

References.

[Ber73]: C. Berge. Graphs and Hypergraphs. Published 1973, North-Holland Publishing Co, Inc. Amsterdam.
[HTM94]: Tim Berners-Lee, CERN and Daniel Connolly, HAL. Hypertext Markup Language Specification - 2.0. Internet draft, October 13, 1994. Published at Second International WWW Conference 1994.
[HTT94]: T. Berners-Lee, R.T. Fielding, H. Frystyk Nielsen. Hypertext Transfer Protocol - HTTP/1.0. Internet draft, Expires May 28, 1995.
[Til94]: James "Eric" Tilton. What is an Infostructure? Published on the WWW at http://www.willamett.edu/~jtilton/info-p.html, dated October 23. 1994.