In this context, we demonstrate that XML path summaries are
useful tools for access path selection, and establish efficient
algorithms for building and exploiting them. This leads to very
efficient processing when used in conjunction with a
path-partitioned store, in particular much better than if tag
partitioning is used. Furthermore, we devise an efficient method
for complex tree reconstruction, with much lower memory needs
than existing alternatives. Our algorithms are implemented in the
XSum Java library [6].
Path summaries The path summary of an XML document
is a tree,
having exactly one node for every path in the document
(see Figure 1).
Moreover, for any summary nodes
,
such that
is a child
of
, we record on the edge
-
whether every node on path
has exactly one child on path
(edge
-
labeled 1), or at least one
child on path
(edge
-
labeled +).
We have established experimentally that path summaries are
generally very small (3 to 6 orders of magnitude smaller than the
documents), however, exploiting them may still be challenging,
unless carefully designed algorithms are used [2]. A path summary is built in
time, using
memory [3]. Our implementation gathers
1 and + labels during summary construction, in
time and
memory.
![]() |
Path-partitioned storage model Structural identifiers are assigned to each element in an XML document. A direct comparison of two structural identifiers suffices to decide whether the corresponding elements are structurally related (one is a parent or ancestor of the other) or not. A very popular such scheme consists of assigning (pre,post,depth) numbers to every node [1]. The pre number corresponds to the positional number of the element's begin tag, and the post number corresponds to the number of its end tag in the document. For example, Figure 1(a) depicts (pre,post) IDs next to the elements. The depth number reflects node depth in the document tree (omitted in Figure 1 to avoid clutter).
Based on structural IDs, our first structure contains a compact representation of the XML tree structure. We partition the identifiers according to the data path of the elements. For each path, we create an ID path sequence, which is the sequence of IDs in document order. Figure 1(c) depicts a few ID path sequences resulting from some paths of the sample document in Figure 1(a). Our second structure stores the contents of XML elements, and values of the attributes. We pair such values to an ID of their closest enclosing element identifier. Figure 1(d) shows some such (ID, value) pair sequences for our sample document.
From an XQuery query, we extract a query pattern, as shown in
Figure 2. We distinguish parent-child
edges (single lines) from ancestor-descendent ones (double
lines). Dashed edges represent optional relationships: the
children (resp. descendents) at the lower end of the edge are not
required for an element to match the upper end of the edge. Edges
crossed by a ``[`` connect parent nodes with children that must
be found in the data, but are not returned by the query,
corresponding to navigation steps in path predicates, and in
``where'' XQuery clauses. We call such nodes
existential. Boxed nodes are those which must actually
be returned by the query. In Figure 2(b),
some auxiliary variables $1,
$2 and $3 are introduced for the expressions in the
return clause, and expressions enclosed in existential brackets
.
![]() |
For pattern node, we compute a minimal set of relevant
paths. A path is relevant for node
iff: (
) the
last tag in
agrees with the tag of
(which may also be *); (
)
satisfies the structural conditions
imposed by the
's ancestors, and (
)
has descendents paths in the path
summary, matching all non-optional descendents of the node.
Relevant path sets are organized in a tree, mirroring the
relationships between their corresponding nodes.
The paths relevant to nodes of the pattern in Figure 2(b) appear in Figure 2(c).
The path 14 for the variable $d has
no impact on the query result, because: ()
$d is not required to compute the
query result; (
) it follows from the path
summary that every element on path 12 (relevant for
$i) has exactly one child on path
14 (relevant to $d). Thus, query
evaluation does not need to find bindings for $d, but only $i and
$2, and combine them. We say 14 is
useless.
A path summary may guarantee that every XML element on path 12 has at least one descendent on path 22, if all paths between 12 and 22 are annotated 1 or +. In this case, 22 is a trivial path for the existential node $3. If the annotations between 12 and 20 are also 1 or +, path 20 is also trivial. The execution engine does not need to check, on the stored data, which elements on path 12 have descendents on paths 20 and 22: we know they all do. Thus, paths 20 and 22 are discarded from the set of $3.
XSum [6] provides an
efficient algorithm [2]
computing minimal path sets in time, using
memory.
Access path selection Given an XML fragmentation model
and a path summary, the following generic XML access path
selection strategy applies: () Compute
relevant paths for query pattern nodes. (
)
Compute associated paths for data in every storage structure
(table, view, index etc.) (
) Choose, for
every query pattern node, a storage structure whose associated
paths form a (tight) superset of the node's relevant paths. In
the case of a path-partitioned store, the query plan resulting
from this access path selection method is exemplified in Figure
3 for the query in Figure 2. IDs(
) designates an access to
the sequence of structural IDs on path
,
while IDAndVal(
) accesses the (ID, value) pairs
where IDs identify elements on path
, and
values are text children of such elements. The left semi-join (
) and the left
outer-joins (
) are
structural, i.e. they combine inputs based on
parent-child or ancestor-descendent relationships between the IDs
they contain. This QEP takes good advantage of relevant paths to
access only a very small subset of the data present in an XMark
document.
Element (re)construction The biggest performance issues using a path-partitioned store concern the task of reconstructing complex XML subtrees, since the data has been aggressively partitioned.
A first approach is to adapt the SortedOuterUnion [4] method for exporting
relational data in XML, to a path-partitioned setting with
structural IDs; this is illustrated in Figure 3. In the worst case, the complexity of this
method is I/O complexity, where
is the blocking factor, and its time
complexity is
.
Based on a path summary and a path-partitioned store, we
devised a more efficient approach, using the physical operator
Reconstruct. This operator reads in parallel the ordered
sequences of structural IDs and (ID, value) pairs from all the
paths to recombine, and produces textual output in which XML
markup (tags) and values are concatenated in the right order
[2] (see Figure 2). Interestingly, Reconstruct does not build
intermediary results, thus it has a smaller memory
footprint, namely , where
is the number of paths from which data is combined. For
large documents,
, thus the
Reconstruct is particularly competitive.
Our experimental validation, as well as a full comparison with
related works on XML query processing, can be found in [2].
[1] S. Al-Khalifa, H.V. Jagadish, J.M. Patel, Y. Wu, N. Koudas, and D. Srivastava. Structural joins: A primitive for efficient XML query pattern matching, ICDE, 2002.
[2] A. Arion, A. Bonifati, I. Manolescu, and A. Pugliese. Path summaries and path partitioning in modern XML databases(full version). www-rocq.inria.fr/~ manolesc, 2006.
[3] R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. VLDB, Athens, Greece, 1997.
[4] J. Shanmugasundaram, J. Kiernan, E. Shekita, C. Fan, and J. Funderburk. Querying XML views of relational data. VLDB, 2001.
[5] The XMark benchmark. www.xml-benchmark.org, 2002.
[6] The XSum library. www-rocq.inria.fr/gemo/XSum, 2005.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright 1993, 1994, 1995, 1996, Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright 1997, 1998, 1999, Ross Moore, Mathematics
Department, Macquarie University, Sydney.