HTML to the Max A Manifesto for Adding SGML Intelligence to the World-Wide Web

C. M. Sperberg-McQueen

Robert F. Goldstein

Abstract

HTML demonstrates that SGML markup is useful for networked information. How can it be made even more useful? One way is to extend the tag set from HTML to HTML2, etc. We argue here for a more radical approach: full SGML awareness in WWW. We believe the difficulties are small, the cost affordable, and the advantages overwhelming.

SGML is a metalanguage for defining markup languages; HTML is just one instance of this infinite family. At present, documents in other SGML document types must be translated into HTML for display by a Mosaic client --- sometimes this imposes unacceptable information loss.

WWW browsers could handle other SGML document types without translation by launching a general-purpose SGML browser to view them, as they now launch graphics viewers; a better solution overall would be to build SGML display into the WWW browsers themselves. Either way, display of an SGML document would be controlled by a style sheet using a small number of display primitives ('bold', 'line break', etc.) to specify the rendition of each element type. For 'well-known' document type definitions (DTDs) like HTML, style sheets could be distributed with the browser, or built in. For other DTDs, the browser would fetch a style sheet from the server. Using style sheets, browser software can also make it easy to customize document display.

DTDs and style sheets can be designed to accommodate extensions, ensuring that authors can make small extensions to the tag set with no change whatsoever in the target browsers and virtually no performance penalty.

A Simple Proposal

Note: this is an opinionated paper, not so much because we think the issues are all black and white as because (a) the ideas we are pushing will be clearer to most readers if we exaggerate them slightly, (b) we don't have enough space to expound all the nuances, and (c) black and white contrasts are more fun to talk and hear about.

Let us start with a simple proposal. The current generation of Web client software, when it receives an HTML document, does something like this:

In the current generation of software, the information used in deciding how to process the tag is hard-coded into the Web client: at code-writing time, the programmer decides how to format each tag, based on the descriptions of typical renderings given in the HTML specification. (Of course this isn't quite true, since default fonts can often be changed at run time or even interactively. But the list of tags and their semantics is fixed by the definition of HTML. Also, decisions about line breaks, justification, etc., are not left to the user.)

Our simple proposal is that all of this continue much as it does now, but that the rules for processing each tag (and by implication, the complete list of tags that can be processed) be loaded dynamically at the time the document is fetched from the server. Web browsers should implement a table of processing options, which they can load from disk (or remote server) and use when processing an HTML (or HTML++++) document.

The two key points are:

Perhaps the most important advantage of this approach is that each author can define his or her own tag set, with its own attached "meaning", without caring (much) about which browser will be used to view the document. Note that "meaning" can have two meanings -- (1) how the tagged object is displayed, and (2) what the relationship of the tagged object is to other tagged objects and to the human. We need not wait for an endless succession of HTML extensions to wend their way through a committee. In our view, the power of semantics belongs at the authors' fingertips, not the programmers'.

All sorts of information can be encoded in a rich tag set, but no one can master a superset-of-everything explicitly. In our proposal, a musician can annotate music, a mathematician can mark an equation, a programmer can comment on code, a statistician can select a column of data, and a database query can have as many different "submit" buttons as one wants.

Currently in HTML, however, there are at most two solutions for these problems: <pre> or <img src= >, both of which preserve a simple display but destroy the vital information necessary for a wide variety of sophisticated displays or other post-processing. If, however, a browser can preserve information from arbitrary tags, then all sorts of post-processing is possible, not excluding the mundane display-on-a-terminal.

The musician might view the document, and then import a few bars into a MIDI program. The mathematician might view an equation, then import it into Mathematica and solve it. The statistician might import a few rows from a table into SAS and compute a standard deviation. And so on. SGML doesn't make it happen, but it does make it possible.

The upside of our proposal is that authors will have complete control over their tag sets, but the downside is that this control must be expressed in terms of a new style-sheet language. We believe this is a substantial gain, because a sensible language that can define tags is much richer than any pre-defined tag set. But a widespread implementation must come about either by widespread agreement, or by one person doing a great job and giving away a gazillion copies of a new (or improved) browser.

How is This Related to SGML at All?

SGML is a metalanguage for defining markup languages; HTML is one example of an SGML-defined language. Our proposal above does not require browsers to parse and handle and validate arbitrary SGML in the usual way that real SGML editors do. Indeed, it does not even force a newly-defined tag set to be SGML compatible. But restricting Web documents to bona fide SGML documents is the only sensible way to go, for two reasons.

First, sufficiency. SGML is rich enough to provide good solutions to virtually all of the Web's markup requirements for many years to come. SGML provides a public, non-proprietary method for interchange of data of all kinds. It is particularly suited for capturing the structured nature of text, and it coexists well with graphics (and other special encoded files) in any format. Roughly speaking, SGML is naturally suited for defining a record type in an arbitrary object-oriented database. If sending tree-structured objects (such as HTML+++ documents) is useful to the Web, SGML is the tool of choice.

Second, necessity. Although browsers need not validate SGML documents, they (or various external viewers) may find it useful to do so. Error detection and recovery is just one use of validation. Without the assurance of a valid SGML document, it would be fairly difficult for a browser to post-process a document --- for example to export a complicated mathematical equation to a clipboard for import into Mathematica. Although a style sheet suffices for many applications, it cannot replace a proper SGML document type definition for other uses. Allowing non-SGML documents opens a large can of worms for viewers to be built in the next few years, but doesn't seem to have any real advantages.

Using External SGML Browsers

There are two obvious approaches to providing better support for SGML on the Web. The first is to treat it like any specialized data format, and to launch specialized browsers to display data in that form. This approach is described in this section. The other approach, integrating SGML awareness beyond HTML awareness into Web browsers, is described in the next section.

Using existing software, it is easy to support SGML as a specialized data type. For example, we have implemented a demonstration of an SGML viewer spawned from mosaic, using the commercial SGML editor Author/Editor, by SoftQuad, Inc. All we did was modify src.conf and .mailcap so both the server and client would recognize .tei as a file of type text/x-sgml-tei.

Using this approach, we can exploit SGML for a number of uses to which HTML is not now suited:

Unfortunately, processing SGML with an external browser does have some limitations and drawbacks. Most important, we cannot, with current software and protocols, use an external SGML viewer to browse hyperlinked documents: or rather, we could, but the viewer has no way of notifying the original browser that the user has clicked on a link end, so there is no way to traverse the hyperlinks. Of course, this limitation applies with equal force to all data formats handled with external browsers. It might be removed by defining callback functions or some other method of communication between the WWW browser and the external viewer, as suggested recently by Bruce R. Schatz and Joseph B. Hardin.

Our demo with Author/Editor sidesteps an important issue. SGML files do not exist in a vacuum --- they need Document Type Declarations and Style Sheets for proper validation and display. In in our demo, and with HTML, this is done by pre-loading the browser with this information. But the whole point of arbitrary SGML support is the possibility of serving an SGML file whose DTD and/or style sheet(s) are not pre-loaded. These extra files must therefore be available somewhere convenient on the net.

We propose that any HTTP server which distributes an SGML document should be responsible for providing the DTD for that document, on demand. But it is desirable for performance reasons, that a given browser might choose not to download the DTD, either because it was a well-known type (and therefore locally cached) or because it was not needed for display processing. A natural solution is to have the HTTP header of the document itself provide a universal resource identifier for the DTD, and other URLs for one or more suitable style sheets. The WWW-Link field could be used for this purpose:

WWW-Link:  href='ftp://ftp-tei.uic.edu/pub/tei/dtd/tei2.dtd';
           rel='DTD'
WWW-Link:  href='ftp://gluon.cc.uic.edu//pub/tei/styles/tei2.style';
           rel='style-sheet'; title='Basic TEI Style Sheet'
WWW-Link:  href='ftp://gluon.cc.uic.edu//pub/tei/styles/tei2beta.style';
           rel='style-sheet'; title='TEI Style Sheet, alternate form'

Style sheets are different from DTDs in that they are more necessary for proper display, but less tightly bound to the document definition, So different processing specifications, or style sheets, can easily be introduced, and the same input document can be processed in multiple ways. Again, the browser might choose not to download a style sheet, either because it has been cached, or because the user prefers to use a locally modified style sheet to suit his display or preferences.

Most existing SGML editors and browsers do have style sheet mechanisms; unfortunately, they are currently product-specific, not standardized. It would be insane, however, to expect authors or publishers to formulate multiple style sheets for their documents, one in each proprietary style-sheet language. If the World Wide Web is to take serious advantage of SGML, this means that a common style-sheet language for browsing and forms, at least, must be agreed on.

In summary, successful widespread use of external SGML browsers seems to require that a number of steps be taken.

Integrating SGML Support in the Web Browser without Losing Your Mind

One might prefer to integrate SGML support into the Web browser, rather than externalizing it into an external viewer. This would give the user more function and performance through a single unified interface. But a Web browser is unlikely to provide SGML support on a par with a stand-alone SGML system, just as it is unlikely to provide graphics facilities equal to those of specialized graphics programs. We are not going to try to argue the case pro and con here; we think it does make sense to integrate SGML intelligence directly into Web clients, and we propose in this section to outline rather briefly what would be involved, and how to keep the task manageable. (N.B. some SGML technical terms are used without warning or definition in the discussion which follows, but we hope the gist is clear.)

After downloading an SGML document, a browser would obtain a style sheet and possibly DTD, either from local sources or elsewhere on the net. It would then parse the SGML and render the document on the screen. To make the parsing code simple, one can restrict acceptable SGML to easily parsable forms without losing any crucial function. So we propose the following Web-wide conventions:

If these conventions are followed, the client need not implement a full SGML parser, merely a non-validating minimal SGML parser, which is somewhat simpler. The implementation becomes simpler still, however, if we adopt a couple of strict application conventions. Even minimal parsers are required to know that empty elements have no end-tags; even non-validating parsers are required to treat newline sequences in different ways, depending on how various elements were declared in the DTD. We can eliminate these requirements by adopting these rules:

Together with the commitment to server-side validation, these two conventions allow the client's SGML parser to ignore the document type declaration (DTD) for a document almost entirely. The start and end of every non-empty element in the document is explicitly marked, and the empty elements are all identified in the style sheet. Because newlines are allowed to be significant to the application only in well defined restricted areas, the parser need not attempt to implement the newline rules of the SGML standard. The parser must scan the DTD only for SGML entity declarations, since it must be able to expand entity references in the document instance.

We could eliminate the need to scan the DTD even for entity references, if we adopt the rule that the server will expand all such references, except for those needed to provide access to Latin characters with diacritics, Greek characters, and the like, which are generally included in standard entity sets issued by ISO. It is probably better, however, to allow for at least the possibility of client-side expansion of entities. This will reduce bandwidth requirements in some cases (functioning like a client-side INCLUDE), but a more important reason is that entity declarations are crucial to SGML interfaces to data in other notations (such as graphics files). Where it is possible, of course, server-side expansion of entity references is desirable, since it completely eliminates the client's need to read the DTD. We propose, therefore, that the HTTP header specify (in ways to be determined, possibly by a content-encoding field) whether entity references will be pre-expanded by the server or not.

We are placing as much as possible of the responsibility for validation and processing on the server, rather than on the client. Validation could be done manually when the document is installed, or automatically when the server sends the document (if the server has enough horsepower). Of course, sending a non-valid SGML document can result in unpredictable behavior on the part of the browser.

Server-side validation does somewhat complicate the process of publishing documents on the Web. Publishers will need software to validate and normalize SGML documents. Fortunately, the public-domain parser SGMLS can readily be used for validation, and with a few tweaks it can also be used as the basis for an SGML normalizer. Of course, most people who will be interested in providing access to SGML documents over the Web already have SGML software, and validation and normalization of documents may already be part of their normal routine. (Server-side normalization is necessary only in order to make it easier to include SGML intelligence in the Web client itself; existing stand-alone SGML browsers are typically able to perform their own normalization.)

If the above conventions are adopted, a browser has only to recognize and processes the following types of SGML markup:

There is no need for detailed discussion of these kinds of markup. Existing HTML software already handles start-tags, end-tags, entity references, and comments; the HTML specification documents but deprecates both processing instructions and marked sections. Conforming parsers must, however, properly recognize and process them. Fortunately, their syntax is relatively straightforward and adds very little to the complexity of the parser.

If the principles of minimal SGML, server-side validation, and the application conventions regarding newlines and empty elements are accepted, then it will be relatively simple to integrate SGML conforming parsers into existing or new WWW client software. Like SGML support using external browsers, of course, this method also requires use of the HTTP header to indicate the location of the DTD and style sheet, and the adoption by the Web community of a common style-sheet language for use in WWW display of documents. We turn now to the description of that style-sheet language.

Style Sheet Languages

Style sheets, like DTDs, are auxiliary documents, used for processing SGML documents. A style sheet is a way of associating properties and actions with each element in an SGML tree of elements, very much like using Xresources to associate properties with a tree of widgets.

DTDs have a standard notation prescribed by ISO standard, but style sheets currently have no standard notation. However ISO is currently balloting an international standard called the Document Style Semantics and Specification Language (DSSSL), which will provide a standard syntax for style sheets and other specifications for SGML processing.

Semantics

Semantically, the style sheet language to be supported must be kept simple, at least at its bottom level. To be useful with existing software, the style sheet language has to allow the specification of fonts, typesizes, colors, and the like, but clients must not be required to support every high-end feature mentioned in the style-sheet language. There have to be specific and well understood methods of specifying the fallback processing to be performed if certain style primitives are not available.

Ideally, the style sheet language should be declarative, not procedural, and should allow style sheets to exploit the structure of SGML documents to the fullest. Styles must be able to vary with the structural location of the element: paragraphs within notes may be formatted differently from paragraphs in the main text. Styles must be able to vary with the attribute values of the element in question: a quotation of type "display" may need to be formatted differently from a quotation of type "inline". They may even need to vary with the attribute values of other elements: items in numbered lists will look different from items in bulleted lists.

At the same time, the language has to be reasonably easy to interpret in a procedural way: implementing the style sheet language should not become the major challenge in implementing a Web client.

The semantics should be additive: It should be possible for users to create new style sheets by adding new specifications to some existing (possibly standard) style sheet. This should not require copying the entire base style sheet; instead, the user should be able to store locally just the user's own changes to the standard style sheet, and they should be added in at browse time. This is particularly important to support local modifications of standard DTDs.

The style-sheet language semantics must be full enough to allow the description of current HTML browsers, restricted enough to make implementation feasible, and constructed with a view to later extension and expansion. As a first step toward understanding what would be required, we have extracted a set of semantic primitives from the descriptions of 'typical rendering' for each element in the specification of HTML. We discovered that a set of fourteen or so primitives suffice to express all the processing described there; these primitives include indications of whether the element is laid out inline or processed as a block; whether internal newlines are respected; whether vertical space is generated before or after it, the margins within which the contents are to be set, etc. We won't go into further detail, since the point of this paper is to talk about incorporating SGML intelligence into the Web, not to propose a specific style-sheet language for adoption by the Web community. A quick description of the style primitives, with examples of style-sheet specifications for some HTML elements, has been posted on the UIC Web server at http://www.uic.edu/~cmsmcq/style-primitives.html

Syntax

Syntactically, the style sheet language must be very simple, preferably trivial to parse. One obvious possibility: formulate the style sheet language as an SGML DTD, so that each style sheet will be an SGML document. Since the browser already knows how to parse SGML, no extra effort will be needed.

Another approach would involve adopting the syntax of some existing language with a very simple syntax: TCL and Lisp come to mind, but Scheme is probably the most plausible candidate in this line of though, since DSSSL incorporates Scheme for some purposes (e.g. for variables and to express conditionality).

We recommend strongly that a subset of DSSSL be used to formulate style sheets for use on the World Wide Web; with the completion of the standards work on DSSSL, there is no reason for any community to invent their own style-sheet language from scratch. The full DSSSL standard may well be too demanding to implement in its entirety, but even if that proves true, it provides only an argument for defining a subset of DSSSL that must be supported, not an argument for rolling our own. Unlike home-brew specifications, a subset of a standard comes with an automatically predefined growth path. We expect to work on the formulation of a usable, implementable subset of DSSSL for use in WWW style sheets, and invite all interested parties to join in the effort.

How Do We Get There from Here?

We envisage the Web moving towards full SGML support in stages, each stage providing a bit of added function. The first stage will see three very loosely coupled developments:

The second stage will naturally grow upon the first:

The third phase will be the explosive one.

Ultimately, we will emerge into a Web with a small number of standard DTDs, each with many individual variations. Awareness of SGML, at the very least use of dynamic style sheets, will be commonly built into the basic browser, although commercial spawnable viewers will provide specialized function. The style sheets for each major DTD will be available on many different servers, and will probably be cached on most clients; only the individual deltas will be transmitted with each document. And the fact that more information is preserved in each document download will mean increased use of specialized secondary "interactive viewers" like SAS, Mathematica or Maple or gnuplot, and so forth.

References

  1. ACH/ACL/ALLC (Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing). Guidelines for Electronic Text Encoding and Interchange, ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.
  2. Berners-Lee, Tim, and Daniel Connolly. Hypertext Markup Language: A Representation of Textual Information and Metainformation for Retrieval and Interchange. (Draft, expired 14 January 1994.)
  3. ISO (International Organization for Standardization). ISO 8879-1986 (E). Information processing --- Text and Office Systems --- Standard Generalized Markup Language (SGML). First edition --- 1986-10-15. [Geneva]: ISO, 1986.
  4. Schatz, Bruce R., and Joseph B. Hardin. NCSA Mosaic and the World Wide Web: Global Hypermedia Protocols for the Internet. Science 265 (12 August 1994): 895-901.

Biographies

C. M. Sperberg-McQueen studied Germanic languages and literatures at Stanford University and the universities of Bonn, Berlin (Free University), and Goettingen. Since 1987 he has been a research programmer at the academic computer center of the University of Illinois at Chicago, where he currently works in the network services group. He is a member of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics. Since 1988 he has been editor in chief of the ACH/ACL/ALLC Text Encoding Initiative.

Robert F. Goldstein has been a research programmer in the academic computer center at the University of Illinois at Chicago since 1987, and currently heads the network services group. He studied physics and biophysics at UC Berkeley and Stanford, and is an adjunct professor in the chemistry department at UIC.

Contact: Robert Goldstein bobg@uic.edu
Michael Sperberg-McQueen cmsmcq@uic.edu