Weblets: Fundamental Building Blocks for WWW Tools
Thomas J. Watt Jr.
OSF/Research Institute
11 Cambridge Center
Cambridge, MA 02142-1405
U.S.A.
watt@osf.org, URL: http://riwww.osf.org:8001/~watt
- Abstract:
- All distributed hypertext infostructures [Til94] have a connected
structure. A weblet is a concept which captures this structure in a
directed graph (DG) [Ber73]. This paper describes the design of a tool
to capture the local structure and remote references of any Web
delivered under an HTTP-compatible server; describes its applications;
projects a future Web re-structuring tool; and mentions future advanced
functionality. The Weblet tool operates sans HTTP[HTT94] server.
- Keywords:
- directed graph; distributed hypertext infostructure; web file
system maintenance; weblet; web re-structuring tool.
1. Web Structure.
The structure of any Web is basically that of a DG, in almost
all instances, a cyclic graph. This cyclic nature helps Web browsing
agents to navigate different traversal points of interest perhaps aided
by computable predicates. Any representation of Web structure must retain
this cyclic nature. Computationally, the requirement is to avoid being
trapped by the cyclic attribute.
2. Weblet Definition.
A weblet is the set of documents reachable from some starting set by
hyperlinks satisfying some given criteria. A weblet expresses the notion of
one or more hypertext-referenced objects, e.g. from a single HTML[HTM94] page
to a complete local Web. What is referred to as a weblet is equivalent to any
(connected) subgraph. Any subgraph of a Web is a weblet, including a Web,
since a Web is a connected subgraph of itself.
2.1 Abstract Weblet.
Figure 1. illustrates the notion of an abstract weblet. The flow is generally
from the top down, and the curved lines represent backward references. The
straight lines to the side represent remote references. Back references to the
root node "a" exist from the path between "e" and "a", and between "g" and "a".
The relations between g->f, f->e, and e->f, f->g infer respectively the
transitive relations g=>e and e=>g.
2.2 Concrete Weblet.
Figure 2. depicts a concrete representation of an abstract weblet. As shown
in the figure, HTML files are represented as chunks, and chained together in
a chunk list which is doubly linked. All chunks refer back to the Root chunk.
Hyperlinks are chained off the chunks. Each hyperlink is doubly linked, and
all hyperlinks refer back to the chunk that contains them.
3. The Weblet Tool Design.
3.1 Basic Design.
The design of the Weblet tool is straightforward. All locally known servers
are encoded in a global table which is searched at the start of the program.
After starting from any accessible HTML document in the Web, a chunk is
created. When the initial chunk is created, we read the file into a buffer,
then scan it for hyperlinks, creating and linking the hyperlink data
structures to the chunk. After creating the hyperlink list hanging off the
chunk, we rescan it, looking for relative names to normalize, testing and
marking by type of hyperlink, ignoring any duplicates.
During the hyperlink scan of the chunk file, after resolving an appropriate
path name for a hyperlinked file, we test for the file's existence with a
stat call. Any file which is non-existent or has been renamed or moved
to another directory will fall out of this test as a broken hyperlink.
Remote hyperlinks are currently ignored in the initial version of the
code.
When we have completed the hyperlink rescan on the first chunk, if we
have any hyperlinks, a main loop is entered. If we have not marked a
hyperlink to ignore it, which is the case concerning e.g. .gif, .ps, and
other types of files; and if the file is in the file system; and if it
meets the criteria of having a filename suffix of either ".htm" or ".html",
we create another chunk, and link it to the previous chunk, repeating
all of the above processing.
When we have discovered the last item in our Web, two outputs are emitted:
- a list of broken hyperlinks, and
- a list of connected files in the Web.
The list of connected files may be overridden to supply other information
via command line options, not described here.
3.2 Weblet Tool Components and Data Structures.
The Weblet tool is designed with abstract data types around the following
data structures:
- A domain_port for server associated data,
- A hyperlink data structure, and
- A chunk data structure for HTML documents.
The operational components consist of several routines, including, a high
performance hyperlink scanner; a hash table routine, applied separately,
one to hyperlinks and the other to HTML chunks; and a set of routines based
around the three main data structures.
4. Current Status of the Weblet Tool.
The Weblet tool has been under part-time development since mid-September 1994.
It is an experimental research prototype. We have implemented several
enhancements suggested for the Weblet Tool in a previous draft of this paper.
They were:
- Capture bracketed hyperlink
- Command line options
- Capture every unique hyperlink reference chain
- Remote hyperlink liveness detection and validation
- Flatten DG structure and provide memory-to-from-disk read/write routines
The first three are complete; we are currently working on the fourth, with
some routines and algorithms ready for the fifth.
There is a high level WWW FORMS interface for users in our group.
5. Weblet Futures.
5.1. A Web Re-Structuring Tool.
Once the foundational changes above are in place, we intend to develop a
Web re-structuring tool.
The basic idea is to provide a graphical user interface (GUI) for authors
which not only depicts a graphical structure map of an existing Web, but is
able to provide browsing and editing operations such as cut-and-paste,
drag-and-drop. If a move button widget selects a hyperlinked object from
somewhere in the Web, and its movement is graphically previewed to somewhere
else; the hyperlink reference chain information collected for that hyperlinked
object should be able to denote the location of where every subsequent broken
hyperlink occurs.
We believe the tool should allow multiple concurrent authoring sessions on
the same local Web, e.g. with appropriate locking functionality.
5.2. Advanced Functionality.
A structure map of a Web can serve a multitude of purposes:
- It determines the reachability of Web hyperlink references and precisely
delineates Web components. This information is useful for web file system
maintenance.
- It can support the constant authoring process by providing a basis for a
Web re-structuring tool.
- It can act as a gathering point for information in support of
agent-oriented servers.
- It can provide the synergy of an extensible framework to support
groupware collaboration.
- It can support the evolution of WWW Tool capabilities overall by
providing an information abundant foundation for a repository with
rapid query response.
A server oriented version of the tool should inter-cooperate with agents and
other servers to provide some of these capabilities.
References.
- [Ber73]
- C. Berge. Graphs and Hypergraphs. Published 1973, North-Holland
Publishing Co, Inc. Amsterdam.
- [HTM94]
- Tim Berners-Lee, CERN and Daniel Connolly, HAL. Hypertext Markup
Language Specification - 2.0. Internet draft, October 13, 1994.
Published at Second International WWW Conference 1994.
- [HTT94]
- T. Berners-Lee, R.T. Fielding, H. Frystyk Nielsen. Hypertext
Transfer Protocol - HTTP/1.0. Internet draft, Expires May 28, 1995.
- [Til94]
- James "Eric" Tilton. What is an Infostructure? Published on the
WWW at http://www.willamett.edu/~jtilton/info-p.html, dated October 23. 1994.
Figures.