Java allows users to define complex data type, like modern languages do. When manipulating XML instances, two different ways can be used: the first one is event based and conforms to the SAX handlers, the second one is tree based and is DOM standard compliant.
In this poster we present an high-level integration of XML to Java, providing XML transducers based on tree view. This gives a homogeneous view of XML within Java code, and and lets users to manipulate such objects like any other ones, without constraint.
Language extension, XML transducers and Embedded XML.
XML[1] Java[6] tools are actually based on two standards. The first one is SAX[12], which is event based . The second one is tree based and called DOM[9]. Both propose only a low level management dedicated to XML instances.
At the opposite, AxE approach is to provide expressive features for XML manipulation. AxE is a simple macro language one could see as Java extension. This brings an expressive level dedicated for XML manipulation. XML manipulations means :
A decomposition must introduce a simple mechanism for information extraction when the instance is complex. In order to provide such a mechanism we introduce a well known concepts in functional programming languages like Objective-Caml[2] : the pattern matching.
The composition is the capability of building XML terms easily. Languages PHP[11] and JSP[10] are essentially based on such concept providing embedded language in HTML. In both cases XML is the main language, and so the constructions are intrinsic. Our approach is slightly different as we consider Java is the reference language, enriched with XML. Nevertheless, as we will see the methodology is in fact the same but well formed XML term is implied by the syntax in AxE in opposition with PHP or JSP where this condition is not statically verified.
The pattern matching introduces a structural decomposition mechanism based on XML tree.
In our approach we define an XML term by :
In the following a tag, the empty term and a text block are called atomic terms in opposition with the sequence which is a set of terms. When performing the pattern matching, such structures must be identified in a natural way.
The pattern matching introduces terms decomposition expressed by a set of rules like :
match $term with <TagName att1=$val1 att2='V2'>$content</> (1) { ... } | <$name att1=$val1 att2='V2'>$content</> (2) { ... } | <_ att1=$val1 att2='V2'>$content</> (3) { ... } | /* nothing */ (4) { ... } | _ when { $term.isCdata() } (5) { ... } end
Rules (1) (2) and (3) introduces tag matching with an attribute att2 with a value V2 and maybe an attribute att1. In the rule (1) the tag name must be TagName, in rules (2) and (3) the tag name is ignored but unified with $name in (2). The rule (4) introduces the empty matching and finally the rule (5) introduces the cdata matching based on conditional matching.
In order to provide expressive, reusable and flexible pattern matching we introduce for tags matching :
The factorization consists of grouping a potential set of tags in only one rule. With this approach the user can group in a natural way a set of tags giving them a same interpretation.
match $term with <(Section|Subsection|Subsubsection) as $tagname> $content </> { ... } end;
In this example the rule matches a Section or a Subsection or a Subsubsection. But in order to retrieve the dynamic tag name an aliasing is introduced through the variable $tagname
The negation introduces the opposite point of view. In fact, while specifications are not required for any XML instance it's quite difficult to write a consistent transducer.
match $term with <~(Section|Subsection|Subsubsection) as $tagname> $content </> { ... } end;
In this example the rule matches all tags else Section, Subsection and Subsubsection. Here $tagname denotes the dynamic name of the matched tag.
The main difference with conventional tools concerns the sequence itself which is explicit. This let the user managing this construction.
match $term with $head $tail { ... } end;
The decomposition of such structure is done by a left-to-right manner. This pattern also matches atomic structures unifying $tail with the empty term. The main problem with this approach concerns the introduction of infinite loop when a pattern matching is partial producing a stack overflow exception during its execution.
For sequence matching regularity is not introduced - in opposition with XDuce[7] - allowing a simple vision of decomposition. Thus the following code
match $term with $pre <table>$table</> $post { ... } end;only matches sequence if the $pre is unified with an atomic term. Then the term
<t1/> <t2/> <table> ... </table> ...is not matched by such pattern.
In the previous section we show features for structural decompositions. But an expressive language dedicated to XML must also provide a intuitive syntax construction. This is done in AxE by the embedded terms. The syntax is the same as patterns without factorization and negation indeed implying the well formed XML term property.
public XML makeHtmlHeader(String title,XML $content) { XML $title = XML.cdata(title); return <? <Html> <Head> <Title>$title</> </> <Body bgcolor='white'>$content</> </> ?>; }This function build an XHTML document with a given content and a title.
The AxE library is equipped by a standard way of reading XML file based on a SAX handler. Such handler builds the corresponding XML tree. Then user can easily read and transform such files.
But with this approach the user can easily build through transducers intermediate XML trees and then can split translation processes. For example in XSLT[3] there is a first level dedicated to XML translation and a second level dedicated to the presentation.
The following example is a naive Word Count used for this document.
import axe.xml.*; import java.util.*; public class xmlwc { private static int wc(XML $term) throws MatchingFailure { match $term with /* EMPTY */ { return 0; } | <_>$content</> { return wc($content); } | _ when { $term.isCdata() } { StringTokenizer token = new StringTokenizer($term.to_string()); return token.countTokens(); } | $head $tail { return wc($head)+wc($tail); } end; } public static void main(String[] argv) { try { XML doc = new XMLHandler().buildFrom(argv[0]); System.out.print("word count ="+wc(doc)); } catch (Exception exn) { exn.printStackTrace(); } } }
Such approach gives an homogeneous way to directly manipulate XML inside programming languages like Java. The matching introduces the capability of information extraction based on pattern while embedded XML introduces proper construction.
A prototype is actually developed on top of Java. XML instances are seen as standard Java objects having their own mechanism for composition and decomposition. The validation of such technology is ongoing, largely based on the development of libraries and tools like :
The main problem concerns the verification of a such transducers. In fact, there is no mechanism for specification and verification. This is still a wide area of investigation as such verifications are not trivial at all, mainly because the regularity of XML types[5][8].