|
|
|
|
|
|
With XML poised to become the de facto language for representing internet data, accessing and querying this data has become increasingly important. To query XML documents efficiently, we developed the QuiXote query processing system. The first step in designing a query processing system is to define an appropriate data model. QuiXote uses the QNX data model that captures the entire DTD semantics, as well as the structure, content and link information present in an XML document. For querying, QuiXote uses the QNX query language, which is a DTD-based query language for XML.
Various data models have been proposed for XML. The semistructured data model as used in Lore [MAG97] and XML-QL [DFF98] is used for data with no schema. Therefore it is less useful when the XML document has a schema (DTD) associated with it. This model also cannot capture the order among child nodes. Using the relational data model for XML has been studied in [STH99, DFS99]. In [STH99], the DTD is used to obtain the relational schema. But the relational schema can capture only a subset of the DTD semantics. Capturing only a portion of the schema semantics will not result in incorrect query processing. But if updates are to be supported, the data model should capture the entire DTD semantics.
Considering the unsuitability of the existing data models, we define the QNX data model for XML repositories. The QNX data model views an XML repository as a set of <schema, setOfDocuments> pairs. In other words, we partition the set of documents in the repository based on their schema. The schema is viewed in this data model as a graph rooted at the DOCTYPE element. The XML document itself is viewed as a graph rooted at the document root element. This captures the intra-document linking in the document. QNX uses "edges" from a node in one graph to a node in another to capture the inter-document linking.
QuiXote uses the QNX data model for viewing a repository, where each document is viewed as a graph. But this does not require that the documents be stored as graphs. QuiXote stores the XML documents as a tree. Storing an XML document as a tree has many advantages in addition to answering queries on parent-child relationships efficiently - (a) Trees are much easier to handle than graphs. For example, the complexity of computing structural summaries (dataguides) as mentioned in Lore [MAG97] can be exponential for graphs, whereas for trees, such summaries can be computed in linear time. (b) Most storage and compression schemes for XML, such as Millau [GNS99] and PDOM [PDOM] store XML documents as trees. Therefore by storing the XML document as trees, QuiXote maintains the flexibility of using any tree storage and load model. (Presently it uses Millau storage and load model).
But storing XML documents as trees is not very efficient for answering queries on links. This is because queries on intra-document and inter-document links are supported using joins. QuiXote relies on link indexing to improve the efficiency of such queries.
The QNX query language is based on the QNX data model. It supports queries on structure, content as well as link information present in XML documents. The query languages for XML either use path expressions (XQL [RLS98] and Lore [MAG97]) or tree structures (XML-QL [DFF98]). A tree can always be decomposed into a set of paths. Therefore a query language that uses tree structures has the same expressive power as one that supports multiple path expressions. An important factor to consider in designing a query language is its ease of use. Queries that do not specify conditions on siblings typically tend to be shorter if path expressions are used. To specify conditions on siblings, tree structures are easier. Based on our previous experience in designing a pattern match and replace language for XML (PatML [SLR99]) and in using QuiXote to perform mining on XML repositories, we decided that QNX shall use tree-structures.
The QNX data model provides the capability to query on the structure, content and link information present in an XML document. Storing the document as trees allows us to perform most document processsing in linear time. It also improves the efficiency of processing queries on parent-child relationships. But storing the document as trees decreases the efficiency of answering queries on link relationships. We would like to investigate various index techniques for increasing the efficiency of these queries.