Preprocessing instructions: Embedding external notations in HTML

Philip Thrift
Member, Technical Staff
Texas Instruments
Dallas TX USA

Abstract

In authoring HTML, one is frequently challenged to include notations external to HTML. Notations like PPI-1 or pic may be useful to embed in an HTML document. Typically this process involves either:

manually creating separate non-HTML documents containing the external notation and translating these to images to embed in the HTML source, or
using a conversion facility like latex2html to translate an entire non-HTML document to HTML

Neither approach is particulary practical to the person authoring primarily in HTML and who only needs to escape to external notation as needed. In this paper a technique for directly embedding external notations in HTML called preprocessing instructions (PPIs) is presented.

Advantages gained by including PPIs are:

all of the data content is available for indexing
changes can easily be incorporated in the source document
notations more suitable for particular expressions can be easily utilized

Finally, SGML issues for PPIs will be discussed.

Introduction

Preprocessing instructions are proposed as a means of embedding non-HTML data and associated translation programs (called filters) in HTML documents. They are useful for encoding data objects not currently expressible in HTML (e.g. math, tables, graphics, etc.) within the HTML document itself, facilitating authoring. In the implementation presented in this document they are syntactically analogous to server-side includes.

Server side includes (SSIs) provide a way to embed deliver-time information in HTML documents. It is supported by NCSA's httpd by following certain conventions in the location and file type of HTML documents containing SSIs. In a typical setup, HTML files with embedded SSIs have a .shtmlfile extension, to distinguish them from `pure' HTML files with a .html file extension. According to the documentation:

All directives to the server are formatted as SGML comments within the document. This is in case the document should ever find itself in the client's hands unparsed. Each directive has the following format:


SSIs provide a way of including deliver-time (or run-time) information within the delivered HTML document, such as date, text sections that are periodically being updated, etc. There is perhaps some controversy about this approach in regards to server load, security, and certainly their SGML correctness. These issues will not be addressed here.

While SSIs are a run-time processing feature, preprocessing instructions (PPIs) are proposed here as a compile-time processing feature. Syntatically, in this experimental implementation, they are similar to SSIs.

Preprocessing instructions

PPIs are used to embed data in HTML documents that is then passed to designated filters that translate the data, the result being incorporated in a new HTML document. HTML documents with PPIs will have the file extension .phtml (pre-HTML) and will be called PHTML files. cphtml is a PHTML compiler with the following usage:

cphtml [-s] [doc.phtml]+

where one or more PHTML files appear on the command line, with the result being for each file either a corresponding doc.html file, or a doc.shtml file, if the -s flag is present.

The format for embedding PPIs within PHTML files is:

where filter is the program that will process the character data in cdata.

Note: filter cannot contain a space character, and cdata cannot contain the close comment delimiter (--). There must be at least one space character between filter and cdata.

When cphtml is executed, it finds each PPI within the doc.phtml file. The designated filter program is executed with two additional environment variables:

$PPIDOC=doc
$PPINUM=ppi#

where doc is the file name root (doc in the above format description) and ppi# is the count of the PPI in the file (starting at 1). The cdata is passed to the program on the standard input (EOF is reached at the close comment delimiter --, which cannot appear in cdata). What filter writes on standard output in inserted in the output HTML file.

Embedding external notations

A typical use of PPIs is to be able to embed external notations in HTML documents, and to designate a filter to convert the external notation into HTML. For example:

where TEX produces a transparent gif image named doc-ppi#.gif (doc and ppi# being the values of the environment variables passed to TEX) from the given PPI-2

data and returns, for example,

<IMG src="doc-ppi#.gif" ALT="doc-ppi#">

This would appear in a viewer as:

The TEX filter can be used to produce inlined transparent gifs for tables and mathematical entities. Here is a shell script for the TEX filter:

      #!/bin/sh

      DOC=$PPIDOC
      NUM=$PPINUM
      cat > $DOC-$NUM.tex << HEAD
      \documentstyle[12pt]{article}
      \unitlength 0.2in
      \thispagestyle{empty}
      \begin{document}
      \noindent
      HEAD
      cat >> $DOC-$NUM.tex
      cat >> $DOC-$NUM.tex << FOOT
      \end{document}
      FOOT
      latex $DOC-$NUM.tex > .latex-errors
      dvips $DOC-$NUM.dvi > $DOC-$NUM.ps
      pstogif $DOC-$NUM.ps $DOC-$NUM.GIF > .pstogif-errors
      giftrans -t 1 -b 0 $DOC-$NUM.GIF > $DOC-$NUM.gif
      rm $DOC-$NUM.tex $DOC-$NUM.dvi $DOC-$NUM.ps \
         $DOC-$NUM.aux $DOC-$NUM.log $DOC-$NUM.GIF 
      echo -n "<IMG SRC=\"$DOC-$NUM.gif\" ALT=\"$DOC-$NUM\">"

Here is an example from a PHTML file where some embedded math appears:


       <P>Euler's equation looks like:
       <P ALIGN=center><A NAME=euler>
         <!--%TEX
            \Large{\[ e^{i\pi} + 1 = 0  \]}
         --></A>
       <P>where <!--%TEX \(e\)--> is the natural 
       logarithm base 
       and <!--%TEX \(i = \sqrt{-1}\)-->. In larger type, this
       looks like <!--%TEX \large{\(i = \sqrt{-1}\)}-->.

This is then translated by cphtml into:

Euler's equation looks like:

where is the natural logarithm base and . In larger type, this looks like .

An example of embedding a table in HTML (whenever HTML 2.0 is released, tables should be supported natively):


      <P ALIGN=center><A NAME=table1>
      <!--%TEX
          \begin{tabular}{|l|l|r|} \hline\hline
              {\em type} &
              \multicolumn{2}{c|}{\em style} \\ \hline\hline
                smart    & red  & short \\
                rather silly & blue & tall \\ \hline\hline
          \end{tabular}
      --></A>

which becomes a PPI-8

rendered table:

Here is an example of a picture environment:

   <!--%TEX
     % unitlength default is 0.2in
     \begin{picture}(5,5)(0,0)
       \put(2,2){\circle{4}}
       \put(2,2){\vector(1,1){1}}
     \end{picture}
   -->

which becomes:

PPIs can also be used for picture description languages such as pic. A PIC script (similar to TEX) can turn


           <!--%PIC
           circle rad .25
           spline right 1 then down .5 left 1 then right 1
           circle same
           -->

into

.

Other uses

PPIs can be used for other uses than as embedded filters for external notations. They can be used, like SSIs do at run-time, to execute commands to embed data in documents. For example:

     I work at <!--%/bin/sh echo -n $ORGANIZATION-->.

into

     I work at Texas Instruments.

They can also be used to embed expressions from other SGML DTDs, (along with an associated filter program to produce HTML) and to implement macros. For example

    <!--FWBK A.html B.html-->

could produce


  <A href="A.html"><IMG src="/www2/fwd.gif"></A>
  <A href="B.html"><IMG src="/www2/back.gif"></A>

which can be used at the bottom of pages for forward and backward links.

SGML issues

The technique of embedding PPIs in SGML comment declarations is not the best way to implement this feature, especially as HTML authoring moves to a more SGML compliant environment. The output of an SGML parser will typically lose all the information in comments, so the data content of PPI's will not be present. In the current prototype of PPIs, it seemed the best way was to follow the SSI example. SGML techniques such as marked sections or NOTATION should be followed in a future PPI specification. A PHTML document type could be defined as an HTML document type with the an additional tag PP:

<PP FILTER=filter>cdata</PP>

Code source

The following can be viewed in the on-line version of this paper at <http://www.ncsa.uiuc.edu/SDG/IT94/IT94Info.html>.

The original PHTML document shows examples of embedded filters (use View Source).
phtml.c: the compiler written in C++ (use CC phtml.c -o phtml to compile)
TEX and PIC: example filters
pstoppm.ps, pstogif and giftrans.c: needed by the example filters (also needed are gs, latex, dvips, ppmtogif, and pnmcrop, available from various public domain sources).

Conclusion

In summary:

PPI filters are executed at compile-time, not run-time
External notations such as and pic can easily be embedded in HTML files
Note: in contrast to latex2html, PPIs allow for "native" HTML development, allowing an escape to external notations only when necessary. cphtml makes it easy for HTML authors to incorporate complex mathematical expressions, a forte of .
When future versions of HTML can express objects like tables, mathematics, and graphics, PHTML files can be easily converted into new HTML files by modifying the appropriate filters.
PHTML files provide additional information (in external notations) for indexing.

Author

Dr. Philip Thrift received the Ph.D.in Applied Mathematics in 1979 from Brown University and joined Texas Intruments in 1982. In various laboratories at TI, he has done research in image understanding, object-oriented and logic programming, machine learning, database mining and information systems. He has several publications in these areas and holds two patents. He is currently working on networked information system applications.

Contact: thrift@csc.ti.com