XML transducers in Java

Didier Plaindoux
Fujitsu European Center For Information Technology Ltd.
8 rue Maryse Hilsz parc de la plaine
31500 Toulouse France
+33 (0)5 62 47 58 32
didier@fecit.fr

ABSTRACT

Java allows users to define complex data type, like modern languages do. When manipulating XML instances, two different ways can be used: the first one is event based and conforms to the SAX handlers, the second one is tree based and is DOM standard compliant.

In this poster we present an high-level integration of XML to Java, providing XML transducers based on tree view. This gives a homogeneous view of XML within Java code, and and lets users to manipulate such objects like any other ones, without constraint.

Keywords

Language extension, XML transducers and Embedded XML.

1. INTRODUCTION

XML[1] Java[6] tools are actually based on two standards. The first one is SAX[12], which is event based . The second one is tree based and called DOM[9]. Both propose only a low level management dedicated to XML instances.

At the opposite, AxE approach is to provide expressive features for XML manipulation. AxE is a simple macro language one could see as Java extension. This brings an expressive level dedicated for XML manipulation. XML manipulations means :

a decomposition mechanism and
a composition mechanism

A decomposition must introduce a simple mechanism for information extraction when the instance is complex. In order to provide such a mechanism we introduce a well known concepts in functional programming languages like Objective-Caml[2] : the pattern matching.

The composition is the capability of building XML terms easily. Languages PHP[11] and JSP[10] are essentially based on such concept providing embedded language in HTML. In both cases XML is the main language, and so the constructions are intrinsic. Our approach is slightly different as we consider Java is the reference language, enriched with XML. Nevertheless, as we will see the methodology is in fact the same but well formed XML term is implied by the syntax in AxE in opposition with PHP or JSP where this condition is not statically verified.

2. Decomposition mechanism

The pattern matching introduces a structural decomposition mechanism based on XML tree.

In our approach we define an XML term by :

a tag composed by a name, a set of attributes and a content,
the empty term
a text block called cdata and
a sequence of XML term.

In the following a tag, the empty term and a text block are called atomic terms in opposition with the sequence which is a set of terms. When performing the pattern matching, such structures must be identified in a natural way.

2.1 Matching atomic terms

The pattern matching introduces terms decomposition expressed by a set of rules like :

  match $term with
    <TagName att1=$val1 att2='V2'>$content</>   (1)
       { ... }
  | <$name att1=$val1 att2='V2'>$content</>     (2)
       { ... }
  | <_ att1=$val1 att2='V2'>$content</>         (3)
       { ... }
  | /* nothing */                               (4)
       { ... }
  | _ when { $term.isCdata() }                  (5)
       { ... }
  end

Rules (1) (2) and (3) introduces tag matching with an attribute att2 with a value V2 and maybe an attribute att1. In the rule (1) the tag name must be TagName, in rules (2) and (3) the tag name is ignored but unified with $name in (2). The rule (4) introduces the empty matching and finally the rule (5) introduces the cdata matching based on conditional matching.

In order to provide expressive, reusable and flexible pattern matching we introduce for tags matching :

a factorization and
a negation features

The factorization consists of grouping a potential set of tags in only one rule. With this approach the user can group in a natural way a set of tags giving them a same interpretation.

  match $term with
    <(Section|Subsection|Subsubsection) as $tagname> $content </> 
        { ... }
  end;

In this example the rule matches a Section or a Subsection or a Subsubsection. But in order to retrieve the dynamic tag name an aliasing is introduced through the variable $tagname

The negation introduces the opposite point of view. In fact, while specifications are not required for any XML instance it's quite difficult to write a consistent transducer.

  match $term with
    <~(Section|Subsection|Subsubsection) as $tagname> $content </> 
        { ... }
  end;

In this example the rule matches all tags else Section, Subsection and Subsubsection. Here $tagname denotes the dynamic name of the matched tag.

2.2 Matching sequence

The main difference with conventional tools concerns the sequence itself which is explicit. This let the user managing this construction.

  match $term with
    $head $tail
        { ... }
  end;

The decomposition of such structure is done by a left-to-right manner. This pattern also matches atomic structures unifying $tail with the empty term. The main problem with this approach concerns the introduction of infinite loop when a pattern matching is partial producing a stack overflow exception during its execution.

For sequence matching regularity is not introduced - in opposition with XDuce[7] - allowing a simple vision of decomposition. Thus the following code

  match $term with
    $pre <table>$table</> $post
        { ... }
  end;

only matches sequence if the $pre is unified with an atomic term. Then the term

  <t1/> <t2/> <table> ... </table> ...

is not matched by such pattern.

3. Composition mechanism

In the previous section we show features for structural decompositions. But an expressive language dedicated to XML must also provide a intuitive syntax construction. This is done in AxE by the embedded terms. The syntax is the same as patterns without factorization and negation indeed implying the well formed XML term property.

  public XML makeHtmlHeader(String title,XML $content) 
  {
     XML $title = XML.cdata(title);
     return <? <Html>
                <Head>
                 <Title>$title</>
                </>
                <Body bgcolor='white'>$content</>
               </> ?>;
  }

This function build an XHTML document with a given content and a title.

4. Reentrant capability

The AxE library is equipped by a standard way of reading XML file based on a SAX handler. Such handler builds the corresponding XML tree. Then user can easily read and transform such files.

But with this approach the user can easily build through transducers intermediate XML trees and then can split translation processes. For example in XSLT[3] there is a first level dedicated to XML translation and a second level dedicated to the presentation.

5. A simple example

The following example is a naive Word Count used for this document.

  import axe.xml.*;
  import java.util.*;
  public class xmlwc {
    private static int wc(XML $term) 
        throws MatchingFailure
    {
        match $term with
          /* EMPTY */
            { 
                return 0;
            }
        | <_>$content</>
            {
                return wc($content);
            }
        | _ when { $term.isCdata() }
            {
                StringTokenizer token = 
                     new StringTokenizer($term.to_string());
                return token.countTokens();
            }
        | $head $tail
            {
                return wc($head)+wc($tail);
            }
        end;
    }
    public static void main(String[] argv) 
    {
        try {
            XML doc = new XMLHandler().buildFrom(argv[0]);
            System.out.print("word count ="+wc(doc));
        } catch (Exception exn) {
            exn.printStackTrace();
        }
    }
  }

6. Conclusion and Perspectives

Such approach gives an homogeneous way to directly manipulate XML inside programming languages like Java. The matching introduces the capability of information extraction based on pattern while embedded XML introduces proper construction.

A prototype is actually developed on top of Java. XML instances are seen as standard Java objects having their own mechanism for composition and decomposition. The validation of such technology is ongoing, largely based on the development of libraries and tools like :

a bootstrap for AxE (abstract syntax is in an XML format),
a web engine based on reflexivity and XML documents repository,
an XML editor based on Swing library,
an XML checker based on XML Schema[4] and
an XSLT checker and compiler.

The main problem concerns the verification of a such transducers. In fact, there is no mechanism for specification and verification. This is still a wide area of investigation as such verifications are not trivial at all, mainly because the regularity of XML types[5][8].

8. REFERENCES

Tim Bray, Jean Paoli and C.M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation. http:/www.w3c.org/TR/1998/REC-xml-19980210
Emmanuel Chailloux, Pascal Manoury and Bruno Pagano. Objective Caml. O'Reilly & Associates, 2000.
James Clark. XSL Transformations (XSLT) Version 1.0 W3C Recommendation 16 November 1999. http://www.w3.org/TR/1999/REC-xslt-19991116
David C. Fallside. XML Schema Part 0: Primer. W3C Recommendation. http://www.w3.org/TR/xmlschema-0
Peter Fankhauser, Mary Fernandez, Ashok Malhotra, Michael Rys, Jerome Simeon, Philip Wadler. XQuery 1.0 Formal Semantics. W3C Working Draft 07 June 2001. http://www.w3.org/TR/query-semantics
David Flanagan. Java in a nutshell. O'Reilly & Associates, 1996
Haruo Hasoya and Benjamin C. Pierce. Tree Automata and Pattern Matching. In The 25th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2000.
Haruo Hasoya, Jerome Vouillon and Benjamin C. Pierce. Regular expression types for XML. In International Conference on Functional Programming, 2000.
Arnaud Le Hors, Philippe Le Hégare, Gavin Nicol, Lauren Wood, Mike Champion and Steve Byrne. Document object model (dom) level 3 core specification. W3C Working Draft http://www.w3.org/TR/2002/WD-DOM-Level-3-Core-20020114/
JavaServer Pages(TM) Technology http://java.sun.com/products/jsp
PHP: Hypertext Preprocessor. http://www.php.net
SAX : The simple API for XML. http://www.saxproject.org/