The origin of (document) species

Rohit Kharea and Adam Rifkinb

aUniversity of California at Irvine,
Department of Computer Science, Irvine, CA 92697-3425, U.S.A.

bCalifornia Institute of Technology,
Computer Science Department, Pasadena, CA 91125, U.S.A.

adam@cs.caltech.edu

Abstract
The World Wide Web's extraordinary reach is based in part on its open assimilation of document formats. Although Web transfer protocols and addressing can accommodate any kinds of resources, the unique application context of a truly global hypermedia system favours the adoption of certain Web-adapted formats. In this paper we consider the evolutionary record that has led to the ascent of the eXtensible Markup Language (XML). We present a taxonomy of document species in the Web according to their syntax, style, structure, and semantics. We observe the preferential adoption of SGML, CSS, HTML, and XML, respectively, which leverage a parsimonious evolutionary strategy favouring declarative encodings over Turing-complete languages; separable styles over inline formatting; declarative markup over presentational markup; and well-defined semantics over operational behavior. The paper concludes with an evolutionary walkthrough of citation formats. Ultimately, combined with the self-referential power of the Web to document itself, we believe XML can catalyze a critical shift of the Web from a global information space into a universal knowledge network.

Keywords
Markup languages; Metadata systems; Information retrieval and modeling; HTML; XML

1. The origin of document species

The World Wide Web is defined as the "universe of network-accessible information" [3]. Such audacity may be a hallmark of hypermedia systems development, but the Web has delivered on this promise in spectacular fashion, for two reasons. The first reason is the openness and content-neutrality of the HyperText Transfer Protocol [9], which can adapt to exchange any document format, and Universal Resource Locators [2], which can represent links to any document format, from within many document formats.

That hypothesis would predict the proliferation of divergent document species in the Web's docuverse code — fragmenting the market between many competing word-processing formats, spreadsheet formats, image formats, and so on.

Instead, we observe the second reason for Web's success: "natural selection" appears to have favoured a few formats that have been explicitly adapted to the Web. Among the profuse variation of document syntax, style, structure, and semantics, we observe the preferential adoption of SGML, CSS, HTML, and XML. Each leverages a parsimonious evolutionary strategy favouring declarative encodings over Turing-complete languages; separable styles over inline formatting; declarative markup over presentational markup; and well-defined semantics over operational behavior.

Whereas the first explanation implies a passive Web that accommodates all document formats equally, the second argues that the medium itself favours evolution from information capture towards knowledge representation. The key is that the Web can be leveraged reflexively to capture a document's structure and semantics — that any community can define its own ontology, or adopt, extend, and combine others. In this context, we argue that the emergence of XML-based formats does not merely represent a slew of new competitors, but an ecosystem of interdependent document species.

1.1. Variation: business cards

In spirit of Darwin's own investigation, let us "acknowledge plainly our ignorance of the cause of each particular variation" [7] and study some of the alternative designs for representing a chunk of knowledge on the Web. Specifically, consider business cards — metadata about people.

In the abstract, business cards are consistent enough: names, titles, addresses, phone and fax numbers, corporate insignia, and so on. In reality, there are innumerable variations of physical form, language, and function. This applies to personal identification on the Web as well.

A card can be represented with natural language in a text file, in any layout. Email .signatures are slightly more regularized, and some programs even attempt to automatically extract personal data. Other text formats such as vCard [21] are explicitly designed for that. Bitmaps or drawings can represent it visually, at the cost of machine-readability of the data. An interactive applet could even represent it as an executable animation.

Web authors have several more prosaic choices as well. HTML formatting can capture its appearance as well as the textual data. Structured HTML, such as the <ADDRESS> tag, can indicate the role of the data and embed it in other documents.

Even the HTML 4.0 tagset does not include specific enough structural markup to represent the logical components of a business card [19]. HTML's centralized evolution caters the common-denominator of document markup. XML, by contrast, is designed for decentralized development of extensible tagsets [4]. Using XML, an author can create a business card Document Type Definition to define specific tags such as <TITLE> and <EMPLOYER> [13].

1.2. Outline of this paper

There is as much variability in the electronic representation of cards as in the cards themselves. In this paper, we explore some of the aspects of variation in syntax, style, structure, and semantics and discuss which alternatives seem better-adapted to the Web. Each of these issues can be identified in our evolutionary walkthrough of citation formats. Finally, we conclude with a taxonomy of several popular document formats and some thoughts about other evolutionary strategies behind the Web's success.

2. Evolving syntax: from Turing-complete to declarative formats

Consider several alternative encodings to represent a public key/name binding: X.509 certificates in binary Abstract Syntax Notation (ASN.1) [11] format; a PGP key information block using readable text and hexadecimal digits; or even an executable program that generates the key on demand. The choice depends on several tradeoffs between the cost of reading, writing, editing, and maintaining each syntax. We characterize document formats based on their choice of binary or text encoding; declarative or Turing-complete grammar; and mission specificity or generality.

Declarative
Turing-Complete
Text
Binary
Text
Binary
Specific
MIF
Dump
JavaScript
Intel x86
General
SGML
ASN.1
UNIX Scripts
COFF

Fig. 1. Examples of document syntax according to encoding, grammar, and mission. MIF stands for FrameMaker Interchange Format; COFF is the Common Object File Format.

The first tradeoff is along the spectrum from binary machine language to textual natural language. Initially, machine-specific coding seems less expensive because it directly mirrors data structures in memory. However, that mapping can be too brittle to use across multiple platforms (for example, endianness) and multiple software versions. Binary coding may also take greater space and time to pack and unpack (a criticism leveled against ASN.1). Text files, especially as S-expressions or other context-free grammars, can be as efficient. In return, even partially human-readable forms are easier to edit, repair, and extend.

The second tradeoff is along the spectrum from declarative formats to Turing-complete programs. It can be easier to write a parser for a program that produces the data than to parse the data itself. It is certainly easier to send a program that calculates pi than to transmit a billion digits of it. PostScript and TeX are powerful examples of programming languages for drawing and typesetting documents. On the other hand, it is formally impossible to manipulate or convert these documents. Trying to extract the third word from a PostScript program is equivalent to the Halting Problem [6]. Declarative formats, such as context-free grammars, are formally tractable, allowing reliable document interchange and maintenance.

Finally, there is a tradeoff between using mission-specific or generic formats. To the degree that information reuse is a critical concern, there is further value to leveraging a family of related grammars. SGML, the Standard Generalized Markup Language [10] for defining and using portable document formats, was designed to be formal enough to allow proofs of document validity, structured enough to handle complex documents, and extensible enough to support management of large information repositories. The power of SGML is reflected in its flexibility in managing of documents of all types, from manuals and press releases to legal contracts and project specifications; and in its reusability for batch processors to produce books, reports, and electronic editions from the same source file(s).

There is a fundamental tension between the performance, cost, and usability of machine-readable and human-readable encoding strategies. The SGML approach strikes a reasonable balance between the two. Human readability implies robustness, and machine-readability implies validity; both qualities add value to information and ease the evolution of documents over time.

3. Evolving style: from formatting to style sheets

As long as there have been documents, there have been authors and designers agonizing over each stroke of the pen, each piece of type, each picture placement. To the degree that we can capture the abstract style of a layout and reuse it, the more value the design and the documents themselves have. The evolutionary history of Web document formats favours externalized formatting over embedded directives precisely so information can be represented independently of style, and vice versa [17].

There have been many approaches to inline formatting-oriented representations, from troff commands to Rich Text Format (RTF) directive stacks to HTML's font tags. Almost inevitably, these are complemented with resuable formatting shortcuts: macro packages, rulers, and browser appearance parameters, in these examples. Building on this experience, the Web's premier document formats, HTML and XML, allow formatting to be externalized with style sheets. Cascading Style Sheets (CSS) goes further by allowing the composition of separate styles, encouraging separation of concerns such as font and colour properties, character sets, graphic flows, and layout [15]. Furthermore, formatting control is shared between the author and reader, who can interpose his or her own chain of style sheets.

Styles are not limited to visual treatments — they can control rendering for displays, paper, audio, Braille, and many other media. This is especially valuable for adapting content to accommodate physical disabilities, dyslexia, and illiteracy — as well as situational deprivations of talking on the phone or working in a noisy environment [22]. Audio streams can be transcribed to text over the Web for deaf users; audio browsers such as pwWebSpeak can dictate pages to users unable to read them according to Aural CSS [18,16].

Though Web technology supports either inline formatting or style sheets, managing large hypertext information systems almost requires externalized formatting to remain navigable and usable.

4. Evolving structure: from presentational to declarative

The anatomy of a newspaper article includes a headline, byline, body, and footer. Various competing document formats attempt to capture these structures in their representations. Some describe chunks of the document in presentational terms: bold, italic, indented, and so on. At the other end of the spectrum, some use declarative terms: title, address, keyboard-input, and so on. Many other formats can be found somewhere in between on this scale, such as the HTML tag <EM> compared to <I> or <ADDRESS>.

Another kind of declarative structure for SGML (and XML) applications are Document Type Definitions (DTDs) that require valid documents to include several elements in specified order ("every occurrence of <STATE> must be prefixed by the <CITY> tag and followed by <ZIP>").

Choosing along this axis entails tradeoffs between accuracy and comprehensiblity: weaker presentational semantics are better understood universally than narrowly useful declarative types. <CITE> became part of the original HTML repertoire, but <ABSTRACT> did not, so the only alternative for WWW7 Conference authors is the presentational <I> tag. Declarative markup that clearly indicates the role of various document parts finds its value in later reuse. For example, indexing engines could assign weights to terms from an abstract more significantly; or, they could automatically extract the reporters from a set of newspaper clippings using an information-capture tool such as webMethods' Web Interface Definition Language (WIDL) [1].

Choosing to describe document structure by its function rather than its form calls for extensible tag sets. Centralized evolution of HTML precludes adding in an exhaustive list of all possible document idioms. A new tag potentially has ambiguous grammar (is it an element or does it pair up with end-tag?), ambiguous semantics (no metadata about the ontology it is based on), and ambiguous presentation (especially without stylesheet hooks). SGML definitively addresses these issues for new DTDs, but the engineering costs are compounded because the SGML specification does not follow accepted computer-science conventions for the description of languages [12], a lesson relearned during the protracted effort to standardize HTML 2.0 as an SGML application [6].

Communities of interest need to publish their own definitions easily, a process facilitated by using XML. These new definitions can even reach past specifying roles to include interpretations and behaviors; that is, they support new semantics.

5. Evolving semantics: from operational to well-defined

The ultimate test of a document format is how well (or poorly) it supports the uses of its contents. Documents exist as artifacts of larger processes like purchasing, reporting, or software development, and these uses bind semantic meaning to parts of a document.

Format support for semantics falls along a spectrum of disclosure, from undocumented; through operational behavior hard-coded in the processes manipulating the contents; to well-defined, openly available and documented definitions of the contents.

Consider a "to do" list in a text file; it is used as a natural language tickler by an end-user. An HTML version with deadlines for each entry might be parsed by a script that reads the contents and alerts the user. Yet a third variant, in XML, might declare a ToDo DTD that explicitly defines a <DEADLINE> container for ISO quoted dates. Only the latter format can claim to have well-defined semantics, bound to the document itself rather than through any one application. It can be embedded within other XML documents and exchanged with other communities while retaining its unambiguous definition.

XML supports casual ontologies cost-effectively through dynamically composable DTDs and validatable instances — addressing the twin failings of SGML: static composition and DTD-dependent validation. Furthermore, XML DTDs are named by URLs, thereby decentralizing, and thus accelerating, the cycle of publishing and adopting new document formats.

Ontologies ultimately embody the survival principle that self-representative or self-describing systems reduce the cost of entry. By developing a Web address for some fragment of knowledge containing programs and/or information, and then sharing the address with someone else, authors can allow the democratic process play out: anyone in the Web community can indicate that the ideas are good by linking to the handle for future usage. They neither copy the document as is, nor copy the document and make minor changes, because to do so would be too expensive; the cheapest way to propagate ideas is by address. This feeds the cycle of natural selection in knowledge representation: usage determines community, which in turn refines the common ontology.

6. Tracing the evolution of citation formats

Suppose while reading this paper you encounter the citation:

"XML, Java, and the Future of the Web", Jon Bosak, World Wide Web Journal, 2(4):219-228, (1997).

Using your intuition, you could ascertain this reference's meaning, whereas a digital parser might be unable to do so. Worse, because of different publishing conventions, another publisher, say the Association for Computing Machinery, might format the reference differently:

J. Bosak, World Wide Web Journal, "XML, Java, and the Future of the Web", 1997, Vol. 2, No. 4, pp. 219-228.

Even minor differences in punctuation and notation can disrupt a computer trying to parse that reference; as a result, automating the conversions between different formats representing the same kind of knowledge is challenging. The use of brittle syntax, with formatting-only and operational semantics, although meaningful enough for human readability, provides little information for machine readability.

By reformatting the citation using presentational HTML markup, the citation becomes more accessible, even if the actual rules ("the second italic phrase is the publication") are invisible:


   <I>XML, Java, and the Future of the Web</I>,
   <I>World Wide Web Journal</I>,
   <TT>Jon Bosak</TT>,
   <B>2(4):219-228</B>,
   <I>1997</I>.

Although HTML lets an author provide some structuring, it requires authors and readers to agree on a convention for the meanings of attributes and values and how they are marked up. This (ambiguous) presentational structuring allows authors and readers to highlight what is important, but it really just represents an evolutionary waystation towards more meaningful structural markup.

By reformatting the citation using structural HTML markup, the actual interpretation rules become somewhat more distinct ("anything in a CITE is a citation title"):


   <CITE>XML, Java, and the Future of the Web</CITE>
   <H2>World Wide Web Journal</H2>
   <H3>Jon Bosak</H3>
   <UL>
   <LI> 2(4):219-228
   <LI> 1997
   </UL>

The use of structure in citations — the headers and list elements as well as the more-specific use of CITE — can enhance the formatting tags as well.

By reformatting the citation using a customized XML citation DTD, the actual interpretation rules become very precise:


<BIB>
   <TITLE>XML, Java, and the Future of the Web</TITLE>
   <JOURNAL>World Wide Web Journal</JOURNAL>
   <AUTHOR>
     <FIRSTNAME>Jon</FIRSTNAME>
     <LASTNAME>Bosak</LASTNAME>
   </AUTHOR>
   <VOLUME>2</VOLUME>
   <NUMBER>4</NUMBER>
   <YEAR>1997</YEAR>
   <PAGES>219-228</PAGES>
</BIB>

While there is an XML DTD for citations behind this example, it is still well-formed even without it. Most readers have enough context to understand its semantics, while machines can still manipulate it reasonably well ("list all the AUTHORs").

This kind of semantic markup allows the information model to be more descriptive, so a machine can capture things a community takes for granted. Syntactic problems such as character encodings and punctuation are defined using structured annotations; document manipulations such as restructuring and filtering can be automated; and each component of a document can be precisely identified.

By reformatting the citation using a combination of public XML citation-related DTDs, the actual interpretation rules remain precise, as well as being accessible on demand from the Web itself:


<?namespace href="http://library.org/bibliography-info" as="BIB"?>
<?namespace href="http://www.w3.org/schemas/rdf-schema" as="RDF"?>
<?namespace href="http://oclc.org/DublinCore/RDFschema" as="DC"?>
<RDF:serialization>
 <RDF:assertions href="http://assertions.org/bib-doc">
   <BIB:TITLE href="http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm">
     XML, Java, and the Future of the Web
   </BIB:TITLE>
   <BIB:JOURNAL href="http://www.w3j.com/">
     World Wide Web Journal
   </BIB:JOURNAL>
   <DC:Creator>Jon Bosak</DC:Creator>
   <BIB:VOLUME>2</BIB:VOLUME>
   <BIB:NUMBER>4</BIB:NUMBER>
   <BIB:YEAR>1997</BIB:YEAR>
   <BIB:PAGES>219-228</BIB:PAGES>
 </RDF:assertions>
</RDF:serialization>

These tags are themselves defined elsewhere on the Web. This reflexivity enables knowledge representation for:

Efforts such as the Platform for Internet Content Selection (PICS) [20] and the Resource Description Format (RDF) [14] provide mechanisms for tranferring machine-readable metadata describing resources among communities. PICS attaches labels to Web resources, using a URL to identify the rating service and rating scheme.

RDF combines the PICS extensions with the metadata model in Netscape's Meta Content Framework (MCF), yielding both a metadata representation model and an XML-based syntax for metadata capture and transfer. An RDF schema, named using a URL, gives a human- and machine-readable set of assertions of attribute-value pairs. The application of technologies such as PICS and RDF in community ontologies help determine commonly-understood meanings for those tags within any given community that will be using them.

7. Evolution, not revolution

The World Wide Web Consortium, the driving force behind XML, sees its mission as leading the evolution of the Web. In the competitive market of Internet technologies, it is instructive to consider how the Web trounced competing species of protocols. Though it shared several adaptations common to Internet protocols, such as "free software spreads faster", "ASCII systems spread faster than binary ones", and "bad protocols imitate; great protocols steal", it leveraged one unique strategy: "self-description". The Web can be built upon itself. Universal Resource Identifiers, machine-readable data formats, and machine-readable specifications can be knit together into an extensible system that assimilates any competitors. In essence, the emergence of XML on the spectrum of Web data formats caps the struggle toward realizing the original vision of the Web by its creators.

In fact, the Web appropriated the philosophy of content-neutrality from MIME types: it learned how to adapt to any document type, new or established, equally well. On the other hand, some types were more equal than others: the Web prefers HTML over PostScript, Microsoft Word, and many others. This preference indicates a general trend over the seven years of Web history from stylistic formatting to structural markup to semantic markup. Each step in the ascent of XML adds momentum to Web applications:

Document format Syntax Style Structure Semantics
ASN.1 Binary   Type-Length-Value Per-application
Text ASCII, Unicode... Lines   Natural language
troff Readable text Inline directives Sections, pages Typesetting
TeX Readable program LaTeX Sections, pages Typesetting
PostScript Programming language   Pages Drawing
Rich Text Format Opaque text Extensible directives Characters, paragraphs  
HTML formatting Readable text Nested directives Presentational  
HTML structure Readable text CSS Declarative Fixed (e.g., <ADDRESS>)
XML Readable text CSS, XSL Declarative Extensible
PICS S-expressions   Ratings Metadata schema
RDF XML Text   Declarative Metadata schema

Fig. 2. Comparison of several species in the evolution of Web document formats.

This evolution toward declarative formats exists not only in the realm of documents, but in the programming community as well. The same forces appear to at work in the successive development of more declarative, less operational programming languages: from machine code to assembly to C to Modula-3 to Java.

The Web itself is becoming a kind of cyborg intelligence: human and machine, harnessed together to generate and manipulate information. If automatability is to be a human right, then machine assistance must eliminate the drudge work involved in exchanging and manipulating knowledge, as indicated by MIT Laboratory for Computer Science Director Michael Dertouzous [8]. The shift from structural HTML markup to semantic XML markup is a critical phase in the struggle to transform the Web from a global information space into a universal knowledge network.


Acknowledgements

We thank the reviewers, Dan Connolly, Tim Berners-Lee, and Doug Lea for their comments and recommendations.

Mr. Khare's work was sponsored by the Defense Advanced Research Projects Agency and Air Force Research Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-97-2-0021. He would also like to thank MCI Internet Architecture for its support in this research.

Mr. Rifkin's work was supported under the Caltech Infospheres Project, sponsored by the CISE directorate of the National Science Foundation under Problem Solving Environments grant CCR-9527130 and by the NSF Center for Research on Parallel Computation under Cooperative Agreement Number CCR-9120008.


References

  1. C. Allen, WIDL: Automating the Web with XML, World Wide Web Journal, 2(4): 229–248, Autumn 1997, available at http://www.webmethods.com/technology/widl.html
  2. T. Berners-Lee, L. Masinter, and M. McCahill, Uniform Resource Locators (URL), RFC 1738, December 1994, available at http://www.w3.org/Addressing/rfc1738.txt
  3. T. Berners-Lee, WWW: past, present, and future, IEEE Computer, 29(10), October 1996.
  4. T. Bray, J. Paoli, and C.M. Sperberg-McQueen, Extensible Markup Language (XML): Part I, Syntax, World Wide Web Consortium Working Draft (Work in Progress), August 1997, available at http://www.w3.org/TR/WD-xml-lang.html
  5. T. Bray and S. DeRose, Extensible Markup Language (XML): Part II, Linking, World Wide Web Consortium Working Draft (Work in Progress), July 1997, available at http://www.w3.org/TR/WD-xml-link.html
  6. D. Connolly, Toward a formalism for communication on the Web, 1994, available at http://www.w3.org/MarkUp/html-spec/html-essay.html
  7. C. Darwin, Origin of Species by Means of Natural Selection, or the Preservation of Favored Races in the Struggle for Life, Charles Darwin, 6th edition, 1872, available at http://149.152.105.38/Honors/EText/Darwin/DarwinOriginContents.html
  8. M. Dertouzous, What Will Be, HarperEdge, 1997.
  9. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. Berners-Lee, Hypertext Transfer Protocol, HTTP/1.1, RFC 2068, January 1997, available at http://www.w3.org/Protocols/rfc2068/rfc2068
  10. ISO 8879:1986, Standard Generalized Markup Language (SGML), 1986.
  11. ISO 8825:1987. Information processing systems – Open systems interconnection – Specification of basic encoding rules for Abstract Syntax Notation One (ASN.1), 1987.
  12. M.J. Kaelbling, On improving SGML, Electronic Publishing: Origination, Dissemination and Design (EPODD), 3(2): 93–98, May 1990.
  13. R. Khare and A. Rifkin, XML: A Door to automated Web applications, IEEE Internet Computing, 1(4): 78–87, July/August 1997, available at http://www.cs.caltech.edu/~adam/papers/xml/x-marks-the-spot.html
  14. O. Lassila and R. Swick, Resource Description Framework (RDF) model and syntax, World Wide Web Consortium Working Draft (Work in Progress), October 1997, available at http://www.w3.org/Metadata/RDF
  15. H.W. Lie and B. Bos, Cascading Style Sheets: Designing for the Web. Addison-Wesley, Reading, MA, April 1997, 256 pp.
  16. C. Lilley and T.V. Raman, Aural Cascading Style Sheets (ACSS), World Wide Web Consortium Working Draft (Work in Progress), June 1997, available at http://www.w3.org/TR/WD-acss
  17. T.H. Nelson, Embedded markup considered harmful, World Wide Web Journal, 2(4): 129–134, Autumn 1997.
  18. M. Paciello, People with disabilities can't access the Web!, World Wide Web Journal, 2(1): 173–182, Winter 1997, available at http://www.w3j.com/5/s3.paciello.html
  19. D. Raggett, A. Le Hors, and I. Jacobs, HTML 4.0 Specification, World Wide Web Consortium Working Draft (Work in Progress), September 1997, available at http://www.w3.org/TR/WD-html40/
  20. P. Resnick and J. Miller, PICS: Internet access controls without censorship, Communications of the ACM, 39: 87–93, 1996, available as http://www.w3.org/pub/WWW/PICS/iacwcv2.htm
  21. versit Consortium, vCard specification Version 2.1, January 1997, available at http://www.imc.org/pdi/vcardwhite.html
  22. Web Accessibility Initiative (WAI), World Wide Web Consortium, 1997, available at http://www.w3.org/WAI/Activity.html

Vitae

Rohit Khare joined the Ph.D. program in computer science at the University of California, Irvine in Fall 1997, after serving as a member of the MCI Internet Architecture staff. He was previously on the technical staff of the World Wide Web Consortium at MIT, where he focused on security and electronic commerce issues. He has been involved in the development of cryptographic software tools and Web-related standards development. Rohit received a B.S. in Engineering and Applied Science and in Economics from California Institute of Technology in 1995.
 
Adam Rifkin received his B.S. and M.S. in Computer Science from the College of William and Mary. He is presently pursuing a Ph.D. in computer science at the California Institute of Technology, where he works with the Caltech Infospheres Project on the composition of active distributed objects. He has done Internet consulting and performed research with several organizations, including Canon, Hewlett-Packard, Griffiss Air Force Base, and the NASA-Langley Research Center.
adam@cs.caltech.edu