Preprocessing instructions: Embedding external notations in HTML

Philip Thrift
Member, Technical Staff
Texas Instruments
Dallas TX USA

Abstract

In authoring HTML, one is frequently challenged to include notations external to HTML. Notations like PPI-1 or pic may be useful to embed in an HTML document. Typically this process involves either:

Neither approach is particulary practical to the person authoring primarily in HTML and who only needs to escape to external notation as needed. In this paper a technique for directly embedding external notations in HTML called preprocessing instructions (PPIs) is presented.

Advantages gained by including PPIs are:

Finally, SGML issues for PPIs will be discussed.

Introduction

Preprocessing instructions are proposed as a means of embedding non-HTML data and associated translation programs (called filters) in HTML documents. They are useful for encoding data objects not currently expressible in HTML (e.g. math, tables, graphics, etc.) within the HTML document itself, facilitating authoring. In the implementation presented in this document they are syntactically analogous to server-side includes.

Server side includes (SSIs) provide a way to embed deliver-time information in HTML documents. It is supported by NCSA's httpd by following certain conventions in the location and file type of HTML documents containing SSIs. In a typical setup, HTML files with embedded SSIs have a .shtmlfile extension, to distinguish them from `pure' HTML files with a .html file extension. According to the documentation:

All directives to the server are formatted as SGML comments within the document. This is in case the document should ever find itself in the client's hands unparsed. Each directive has the following format:

<!--#command tag1="value1" tag2="value2"-->

SSIs provide a way of including deliver-time (or run-time) information within the delivered HTML document, such as date, text sections that are periodically being updated, etc. There is perhaps some controversy about this approach in regards to server load, security, and certainly their SGML correctness. These issues will not be addressed here.

While SSIs are a run-time processing feature, preprocessing instructions (PPIs) are proposed here as a compile-time processing feature. Syntatically, in this experimental implementation, they are similar to SSIs.

Preprocessing instructions

PPIs are used to embed data in HTML documents that is then passed to designated filters that translate the data, the result being incorporated in a new HTML document. HTML documents with PPIs will have the file extension .phtml (pre-HTML) and will be called PHTML files. cphtml is a PHTML compiler with the following usage:

cphtml [-s] [doc.phtml]+
where one or more PHTML files appear on the command line, with the result being for each file either a corresponding doc.html file, or a doc.shtml file, if the -s flag is present.

The format for embedding PPIs within PHTML files is:

<!--%filter cdata-->

where filter is the program that will process the character data in cdata.

Note: filter cannot contain a space character, and cdata cannot contain the close comment delimiter (--). There must be at least one space character between filter and cdata.

When cphtml is executed, it finds each PPI within the doc.phtml file. The designated filter program is executed with two additional environment variables:

$PPIDOC=doc
$PPINUM=ppi#

where doc is the file name root (doc in the above format description) and ppi# is the count of the PPI in the file (starting at 1). The cdata is passed to the program on the standard input (EOF is reached at the close comment delimiter --, which cannot appear in cdata). What filter writes on standard output in inserted in the output HTML file.

Embedding external notations

A typical use of PPIs is to be able to embed external notations in HTML documents, and to designate a filter to convert the external notation into HTML. For example:

<!--%TEX \Large{\(E = mc^2\)}-->
where TEX produces a transparent gif image named doc-ppi#.gif (doc and ppi# being the values of the environment variables passed to TEX) from the given PPI-2 data and returns, for example,

<IMG src="doc-ppi#.gif" ALT="doc-ppi#">

This would appear in a viewer as:
PPI-3

The TEX filter can be used to produce inlined transparent gifs for tables and mathematical entities. Here is a shell script for the TEX filter:

      #!/bin/sh

      DOC=$PPIDOC
      NUM=$PPINUM
      cat > $DOC-$NUM.tex << HEAD
      \documentstyle[12pt]{article}
      \unitlength 0.2in
      \thispagestyle{empty}
      \begin{document}
      \noindent
      HEAD
      cat >> $DOC-$NUM.tex
      cat >> $DOC-$NUM.tex << FOOT
      \end{document}
      FOOT
      latex $DOC-$NUM.tex > .latex-errors
      dvips $DOC-$NUM.dvi > $DOC-$NUM.ps
      pstogif $DOC-$NUM.ps $DOC-$NUM.GIF > .pstogif-errors
      giftrans -t 1 -b 0 $DOC-$NUM.GIF > $DOC-$NUM.gif
      rm $DOC-$NUM.tex $DOC-$NUM.dvi $DOC-$NUM.ps \
         $DOC-$NUM.aux $DOC-$NUM.log $DOC-$NUM.GIF 
      echo -n "<IMG SRC=\"$DOC-$NUM.gif\" ALT=\"$DOC-$NUM\">"
      
      

Here is an example from a PHTML file where some embedded math appears:


       <P>Euler's equation looks like:
       <P ALIGN=center><A NAME=euler>
         <!--%TEX
            \Large{\[ e^{i\pi} + 1 = 0  \]}
         --></A>
       <P>where <!--%TEX \(e\)--> is the natural 
       logarithm base 
       and <!--%TEX \(i = \sqrt{-1}\)-->. In larger type, this
       looks like <!--%TEX \large{\(i = \sqrt{-1}\)}-->.


This is then translated by cphtml into:

Euler's equation looks like:

PPI-4

where PPI-5 is the natural logarithm base and PPI-6. In larger type, this looks like PPI-7.

An example of embedding a table in HTML (whenever HTML 2.0 is released, tables should be supported natively):


      <P ALIGN=center><A NAME=table1>
      <!--%TEX
          \begin{tabular}{|l|l|r|} \hline\hline
              {\em type} &
              \multicolumn{2}{c|}{\em style} \\ \hline\hline
                smart    & red  & short \\
                rather silly & blue & tall \\ \hline\hline
          \end{tabular}
      --></A>


which becomes a PPI-8 rendered table:

PPI-9

Here is an example of a picture environment:
   <!--%TEX
     % unitlength default is 0.2in
     \begin{picture}(5,5)(0,0)
       \put(2,2){\circle{4}}
       \put(2,2){\vector(1,1){1}}
     \end{picture}
   --> 
which becomes:
PPI-10
PPIs can also be used for picture description languages such as pic. A PIC script (similar to TEX) can turn

           <!--%PIC
           circle rad .25
           spline right 1 then down .5 left 1 then right 1
           circle same
           -->

into
PPI-11.

Other uses

PPIs can be used for other uses than as embedded filters for external notations. They can be used, like SSIs do at run-time, to execute commands to embed data in documents. For example:
     I work at <!--%/bin/sh echo -n $ORGANIZATION-->.
into
     I work at Texas Instruments.
They can also be used to embed expressions from other SGML DTDs, (along with an associated filter program to produce HTML) and to implement macros. For example
    <!--FWBK A.html B.html-->
could produce

  <A href="A.html"><IMG src="/www2/fwd.gif"></A>
  <A href="B.html"><IMG src="/www2/back.gif"></A>

which can be used at the bottom of pages for forward and backward links.

SGML issues

The technique of embedding PPIs in SGML comment declarations is not the best way to implement this feature, especially as HTML authoring moves to a more SGML compliant environment. The output of an SGML parser will typically lose all the information in comments, so the data content of PPI's will not be present. In the current prototype of PPIs, it seemed the best way was to follow the SSI example. SGML techniques such as marked sections or NOTATION should be followed in a future PPI specification. A PHTML document type could be defined as an HTML document type with the an additional tag PP:
<PP FILTER=filter>cdata</PP>

Code source

The following can be viewed in the on-line version of this paper at <http://www.ncsa.uiuc.edu/SDG/IT94/IT94Info.html>.

Conclusion

In summary:

Author

Dr. Philip Thrift received the Ph.D.in Applied Mathematics in 1979 from Brown University and joined Texas Intruments in 1982. In various laboratories at TI, he has done research in image understanding, object-oriented and logic programming, machine learning, database mining and information systems. He has several publications in these areas and holds two patents. He is currently working on networked information system applications.

Contact: thrift@csc.ti.com