From a Web to a linear HTML document

Abstract

The lingua franca of the World Wide Web is the HyperText Markup Language (HTML) which presents information as a Web of small documents linked to each other. While this Web navigation is convenient when searching for a specific information, it may be cumbersome when all the information must be read anyway (like traditional linear texts). The proposed approach adds linearization hints to a Web of HTML documents and provides tools to automatically generate a linear version of a Web. Furthermore, a table of contents, an index and a list of references are automatically produced for the linear version.

Introduction

In the past few years, the number of Internet users and the amount of information available have increased significantly. The World Wide Web is probably the medium which has enjoyed the most important growth. It can be defined as a set of protocols (Hypertext Transfer Protocol) [1], formats (Hypertext Markup Language, HTML) [2], software tools and conventions.

Hypertext documents are nodes of information linked together. This way, each separate topic resides in a separate file and may be used in different contexts. For instance, an HTML document on copying garbage collection may be pointed to from documents on algorithms, object oriented programming, or on the Modula-3 language.

Other document processing formats such as LaTeX [19] may be divided into small sections which are merged into a linear document with "input" commands. However, these systems have little support for reusing a file in several documents. Indeed, the pathnames must be relative to the document root, instead of to the current file in HTML, and header level must be adequate for the context of the including document. Other popular document formats such as GNU info [] and Linux-doc sgml present similar problems. Moreover, with the importance of the World Wide Web as the most up to date source of information, and the wide availability of tools such as Amaya [] for HTML authoring, it is preferable to use HTML as the source format than to convert from another format (e.g. LaTeX) to HTML.

The flexibility allowed by HTML to split documents into small reusable nodes is also a source of problems. Indeed, in many cases the information should be available both as an online Web and as a linear printable document. Examples include the Modula-3 language definition and libraries, the Java language definition, libraries and tutorial, and several documents from the World Wide Web consortium (W3C) and the Linux documentation project.

Linearization Rules and Issues

An HTML document usually consists of:

<HTML>
<HEAD>
<TITLE> This is the title </TITLE>
</HEAD>
<BODY>
detailed text here
</BODY>
</HTML>

Where only the TITLE is mandatory. The BODY implicitly starts after the TITLE, if no HEAD or BODY section is tagged. The linearization of a Web starts at a node specified as root. From there, hypertext links are followed in a depth first search, with the following rules.

Each node reached is assigned a unique identifier (derived from its URL). A table of nodes visited (storing the node name (URL) and identifier) is kept. A list of currently visited nodes, along the path between the root and the currently visited node, is also kept.
When a node already on the currently visited nodes list is reached, a cycle exists which must be broken. The link is not followed. When an already visited node is reached, it may or not be followed, but at least a warning message should be issued.
In HTML, sections (chapter, section, subsections...) are delineated with <DIV> and </DIV>. DIV may be used for all sorts of hierarchical containers and must be characterized by their CLASS attribute. The header level tags (<H1>) thus identify the section title, while the depth could be inferred from the nesting level of <DIV> elements. In practice, however, few documents specify <DIV> tags and most rely solely on the header level. The convention proposed here is that the effective depth of a header is the nesting depth of <DIV CLASS=SECTION> elements plus the header level.
The DIV elements may also be used to identify special portions of a document (<DIV CLASS=ABSTRACT>...</DIV> <DIV CLASS=APPENDIX>...</DIV>). If a DIV element is both a hierarchical level within the sections and identifies a special portion, the CLASS attribute may accommodate a list of values (<DIV CLASS=SECTION,ABSTRACT>).
Links with the attribute REL=INCLUDES are followed. The link is replaced by a DIV with the node identifier as ID and the node body as content. For example: <A REL=INCLUDES href="subdir/toto.html"> The essence of foo </A>

is thus replaced by

<DIV ID="...subdir/foo.html">
body of foo.html
</DIV>
Links with the attribute REL=SUBDOCUMENT are followed. The only difference with the previous case is that the DIV has the CLASS=SECTION attribute.
For both REL=SUBDOCUMENT and REL=INCLUDE, sometimes the included node contains a title which is not appropriate in the context where it is included. When the link is of CLASS=SKIPTITLE, the first H1 element within the included node is removed (or replaced with an empty <DIV></DIV>). When the link is of CLASS=REPLACETITLE, the text within the link (<A ...> text within the link </A>) is used to replace the first H1 element within the included node.
When a node is included, all its IDs are replaced by the concatenation of the node identifier and the old ID. This way, IDs should remain globally unique, provided that node identifiers have a special format which would not be used as ID within a node.
Other links are not followed. However, if they reference an ID internal to the node, the HREF is replaced by the concatenation of #, the node identifier and the old ID with the leading # removed.
Links to subdocuments are often placed in a list (<UL>). However, when replacing the links by the nodes, the list should be removed. Thus, a list which only contains links replaced by nodes should be removed. A list which contains some but not only such links may or not be removed but a warning should be issued. Similarly, other elements may not be wanted in the printable document. They may be identified by CLASS=SKIPLINEAR.
As a convenience, it is possible to specify to the linearization tool a set of mappings to affect the links. Each mapping contains an HREF and REL pattern (as in the regex library) and an HREF and REL replacement. This is mostly useful to prune the links to follow (replacing the REL=SUBDOCUMENT by REL="" for instance) or to redirect a reference to another document.
Once the linearization is performed, a second pass is required to adjust links which became internal. Every link to a node which appears in the visited nodes table should have its HREF changed to the concatenation of #, node identifier, and if applicable the ID within the node with the leading # removed. This is demonstrated in the next two figures.
Figure 2: Nonlinear structure

Figure 2 presents the structure of the Web site before the linearization and Figure 3 after the linearization. When performing the linearization on the graph, links which have the label REL=SUBDOCUMENT and REL=INCLUDES are followed. In the example, this is the case for links a,b,c and i. Links d,e,f,g,h,j,k and l don't have this label and are treated as references.
Two types of references exist: internal references and external references. Some references were already internal to a node. Other references were external to a node but point to another node included in the linear document and thus become internal. Finally, some external references remain external after the linearization. In the nonlinear structure, links f and g are initially external references. However, since they point to nodes 3 and 5 which have been included in the linear document, they become internal references and their HREF attribute is updated accordingly in the linear document. However, links k and h, which were also external references, point to nodes 6 and 7 which are not included in the linear document. Therefore, they remain unchanged as external references in the linear document.

Figure 3: Linear structure
Sometimes extra information must be stored in nodes. It may be indexing terms for an index, short descriptions to add to a table of contents, or even a bibliographic entry for the document. As of HTML 3.2, there is no container element which may store but not display contained elements (e.g. <DIV CLASS=NODISPLAY>). Therefore, such information may have to be "encoded" within attributes of other elements, such as the ID of a SPOT element:

<SPOT CLASS=INDEX ID="xxx_key_text">

where "xxx" is the character string to use in the index, "key" the sorting key, and "text" the text associated with this index entry. This information may then be used to create an index for a Web site, or for a linearized document.

Construction of Derived Nodes

Once the linearization is performed, it is useful to construct derived nodes containing information about the linear document, or about the web of visited nodes. In the first case, the address within the linear document is used, while in the second case the original address within the web is used.

A table of contents lists all the headers. Nested lists are used to represent the sections hierarchy. Each list entry is a link containing the section title, and pointing to the relevant position in the linear document or the web of visited nodes. Thus, each header should have an ID attribute. Similarly, a list of figures or tables may be constructed; the FIGURE element was recently dropped from the newer HTML 3.2 proposal however.

A reference list is another important but slightly more complex derived node. It should contain a list of all references external to the linear document. However, these references are in URL form while most printed documents prefer traditional bibliographic entries. In some cases the URL may point to a bibliographic entry:

<DIV ID="Here" CLASS=BIBITEM>
Author,<I>book title</I>, O'Connell, 1995
</DIV>
In other cases, the URL may point to an ordinary document. However, a bibliographic entry may be deduced:
<TITLE> book title </TITLE>
<H1> book title </H1>
<P>
<DIV CLASS=AUTHOR> Author </DIV>
<P>
<DIV CLASS=BIBITEM.TAIL> O'Connell, 1995 </DIV>
... If there is no such information to be found, the entry would be simply the URL.

Conversion to ASCII

To produce an ASCII document, the table of contents may be prepended to the document, and the reference list and index appended. The references in the table of contents should be replaced by the page number of the corresponding section. The entries in the reference list should have a tag assigned (order of first occurrence, e.g. [1] [2], or a symbolic name used in anchors, e.g. [Connolly1996]).
Then, external references in the document must be replaced by the tag associated with the corresponding entry in the reference list.
The internal references to figures, sections and tables should be replaced by the figure/section/table number. There may be cases where one wants the page number instead. This may be achieved with the attribute CLASS=PAGEREF in <A HREF=...>.

Producing LaTeX documents

LaTeX can automatically generate the table of contents and the index, and takes care of \label and \ref, or \pageref, for sections, tables and figures. Thus, the conversion between HTML and LaTeX is more straightforward. The major item where special care is required is the reference list. It must be generated by the linearization tool. However, LaTeX is able to connect each \cite to the corresponding \bibitem in the reference list.

Results

The linearization tool implemented in htmlkit was tested on a Sun Sparc 5 workstation with 32 Mo of RAM. In each case, a web was linearized, an index, a table of contents and a list of references were built, and the output was converted to ASCII, LaTeX and Postscript.

Table 1 presents the elapsed time as a function of the number of nodes linearized. The processing time grows linearly with the number of nodes involved. In the current implementation, the modularity was favored over the performance. In any case, in typical applications, the network or disk access delay is likely to be the major limiting factor. Nevertheless, a large document may be linearized in approximately one minute.

**Table 1:** Linearization time versus number of nodes
Nbr. of nodes	CPU time (s)	User delay (s)	Size
Nbr. of nodes	CPU time (s)	User delay (s)	files (Ko)	memory (Ko)
34	28.31	31.84	34.687	2494
94	35.11	38.62	53.654	2536
125	40.81	45.84	90.132	2685
186	52.72	71.89	101.866	2699
217	56.75	76.33	107.630	2703
248	61.17	82.66	128.856	2709
306	71.03	110.84	166.397	2816

Figure 4: CPU time used versus nbr. of nodes in the network

Figure 5 presents the memory consumption as a function of the number of nodes linearized. The memory usage grows less than linearly with the number of nodes involved. The size of the VISITED NODES list and of the REFERENCES list grows linearly with the number of nodes; the entries in these lists are quite small though. The size of the memory devoted to storing complete nodes is proportional to the number of currently visited nodes, which is typically proportional to the logarithm of the number of nodes (it represents the depth of the tree formed by the links where the REL attribute is SUBDOCUMENT or INCLUDE). Again, the memory consumption for a large document is not a problem, at below 3MB.

Figure 5: Memory used versus nbr. of nodes in the network

Figure 6 presents the memory consumption, for a constant linear document size, as a function of the node size (which in this case is inversely proportional to the number of nodes). The effect of enlarging the granularity of nodes is to slightly reduce the tree depth, and thus the number of currently visited nodes, but mostly to enlarge the size of each of visited node. The size of the VISITED NODES and REFERENCES lists shrinks with the smaller number of nodes but their contribution to the total memory consumption is small. The net result is an almost linear increase of the memory consumption with the node size.

Figure 6: Memory used versus the size of bytes/nodes for a constant graph size (94Kb)

From a Web to a linear HTML document

Abstract

Introduction

Objectives

Linearization Rules and Issues

Construction of Derived Nodes

Conversion to ASCII

Producing LaTeX documents

Implementation

Results

Conclusion

References