Querying Linked XML Document Networks in the Web

Wolfgang May

Institut für Informatik, Universität Freiburg, Germany

may@informatik.uni-freiburg.de

Abstract
The W3C XML Linking Language (XLink) provides a powerful means for interlinking XML documents all over the world. From the database (and in general, querying) point of view, elements with linking semantics as specified by XLink can be seen as embedded views in an XML instance. Compared with classical databases, i.e., SQL and relational data, the situation of having links inside the data is new, raising new aspects for query languages: for using such documents, strategies how to handle links are required. There is not yet an official proposal on the interaction of interlinked XML documents using the XLink language and navigation/querying. We investigate a model where interlinked documents are regarded as virtual XML subtrees, i.e., XML views. Several strategies are presented how to handle such subtrees, concerning the timepoint when the link is evaluated, and the evaluation and caching strategies. The evaluation strategies are influenced by capabilities of the linked XML servers. So far, the approach is independent from the actual querying language. The approach is under implementation as an extension of LoPiX [LoPiX], a web-aware system for XML data manipulation.

1. Introduction

XML has been designed and accepted as the framework for semi-structured data where it plays the same role as the relational model for classical databases. As a data format tailored to the use in the Web, XML data is not required to be self-contained on an individual server, but may include links to XML data on other servers. The XML linking semantics is specified in the XML Linking Language (XLink) [XLink] by providing special tags in the xlink: namespace that tell an application that an element is equipped with link semantics. The remote servers may be autonomous, i.e., they are not aware of being referenced by other documents. In general, links specified by using XLink express whatever relationships between resources. We see our work from the database point of view, where information is distributed over several XML instances: we restrict the considerations to the querying aspect, i.e., to links that specify an XML tree that is intended to be used at the position of the link element. Although, many of the considerations carry over to the document area, and to resources described by XML in general. In our context, a link element in an XML document is primarily an embedded view definition, given as a query against another document where suitable information can be found. Thus, this view has to be evaluated when the link element is actually traversed in a query. Working with such documents containing links to remote sources raises several questions:

the size of the complete linked documents may be tremendous (since they may again contain links to further documents), even if perhaps only a small excerpt is needed for an application.
if a link element is found, is it preferable to ``evaluate'' it first, ``materializing'' the answer instead of retaining the link, or is it preferable to regard the link as a virtual tree fragment that is queried on-demand? In both cases, the practicability heavily depends on the services provided by the remote data source. In the virtual case, additional strategies may be used based on metadata information of the remote source to check if the source is relevant for answering the actual query
it is desirable that links are transparent, i.e., queries behave as if the target of the link is declaratively accessible via the link element.

In the presence of internet-wide distributed documents provided by servers that implement different access policies and server-side/client-side querying mechanisms, efficient evaluation of queries requires a well-defined and well-specified cooperation between the involved components. The paper is organized as follows: Since we do not assume that everybody is familiar with the XLink specification, we first review the semantics of links as given in [XLink], also introducing a running example. Section 3 analyzes the correspondences between links and the view concept and investigates navigation along links from the point of view of the data model. The distribution aspect of evaluating queries on the local server or on remote servers is introduced in Section 4, complemented by considerations on prefetching and caching of data in Section 5. We systematically identify characteristic parameters for specifying the handling of links and relate them to the service functionality provided by remote sources in Section 6. An experimental environment is described in Section 7. Section 8 concludes the paper.

2. XML Notions

2.1 Querying XML

Example 1 Consider the following excerpt of the Mondial database [Mondial] for illustrations.

 <!ELEMENT mondial (country+, organization+, ...)>
 <!ELEMENT country (name, city+, ...)>
     <!ATTLIST country car_code ID #REQUIRED
                          capital IDREFS #REQUIRED>
 <!ELEMENT name (#PCDATA)>
 <!ELEMENT city EMPTY>
     <!ATTLIST city id ID #REQUIRED 
                       name CDATA #REQUIRED>
 <!ELEMENT organization (name, members*)>
     <!ATTLIST organization abbrev ID #REQUIRED 
                               seat IDREF #IMPLIED>
 <!ELEMENT members EMPTY> 
     <!ATTLIST members type CDATA #REQUIRED 
                          country IDREFS #REQUIRED> 
 <country car_code=``B'' capital=``cty-B-Brussels''>
     <name> Belgium</name>
     <city id=``cty-B-Brussels'' name=``Brussels''/>
 </country>
 <organization abbrev=``EU'' seat=``cty-B-Brussels''>
     <name> European Union</name> 
     <members type=``member''  country=``A B D E F ...''/>
 </organization>

XPath [XPath] is the common language for addressing node sets in XML documents; we assume that the reader is familiar with XPath. It provides the base for most XML querying languages that extend it with their special constructs (e.g., functional style commands and variables in XSLT [XSLT], and SQL/OQL style (e.g., joins) in Quilt/XQuery [DFF+99, XQuery]). The core XML/XPath concept already provides unidirectional intra-document references by ID/IDREF/IDREFS attributes.

Example 2 The following query is used throughout the paper for illustrating dereferencing: ``Select all names of cities that are seats of an organization and the capital of one of its members.''

XPath uses the id(...) function for dereferencing:

  id(//organization[id(./@seat) =
                    id(id(./members/@country)/@capital)]/@seat)/@name

yields the names of the above-mentioned cities.

The XQuery proposal [XQuery] (which is based on Quilt [CRF00]) is the W3C recommendation for querying XML documents. Its querying fragment embeds XPath syntax into higher-level constructs (similar to SQL/OQL). Although, already the dereferencing functionality of XPath is augmented in Quilt/XQuery with a ``=>'' operator for explicit dereferencing. An additional name test (added by XQuery wrt. Quilt) specifies the expected name of the target element.

Example 3 In XQuery, the above expression is written as

  //organization[@seat=>city = members/@country=>country/@capital=>city]
                /@seat=>city/@name .

Example 4 The complete query given above can be stated in XQuery as follows (returning the organisation's abbreviation and the city's name). We assume that the city's name is unique to have a simpler criterion later when using distributed sources:

   FOR $org IN //organization, 
       $abbrev IN $org/@abbrev, 
       $seatname IN $org/@seat/@name,
       $capname IN $org/members/@country/@capital/@name 
   WHERE $capname = $seatname 
   RETURN <result org = {$abbrev} city = {$seatname}/>

Here the FOR clause defines variables that are bound to the elements of the result set of XPath expressions, yielding a list of variable bindings. The WHERE clause -- similar to SQL/OQL -- states additional conditions which tuples of variable bindings are used for generating a result element for each variable binding in the RETURN clause.

2.2 XPointer and XLink

XPointer

XPointer [XPtr] is a specialized extension of XPath for selecting parts of XML documents -- which are not necessarily sets of nodes. The XPointer concept combines the URL document addressing mechanism with an extension of the XPath mechanism for addressing fragments of the document. XPointer ``hyperlink'' addresses are of the form url#ext-xpath-expr. Thus, everybody can define links from his XML documents to any other XML source. For this work, we restrict ourselves to standard XPath expressions as pointers, i.e., our XPointers are of the form url#xpath-expr. E.g., the following XPointer addresses the country element that has a car_code attribute with value ``B'' in the document with the url www.ourserver.de/Mondial/mondial.xml:

  www.ourserver.de/Mondial/mondial.xml
      #descendant::country[@car_code=``B'']

XML Linking Language (XLink)

XPointers are used in the XML Linking Language (XLink) [XLink] for expressing links between XML documents. Arbitrary elements can be declared to have link semantics by equipping them with an xlink:type attribute and suitable additional attributes and subelements from the xlink: namespace. These specify the behavior of a link, i.e., its activating event and the triggered action. Whereas the HTML behavior is simply ``browse this document'', the XLink language allows for more sophisticated activities (e.g., not to delay the behavior until the link is activated by a user, but evaluate it immediately when the document is parsed). When links are used for querying XML document networks, this functionality has to be extended (as we will do in the subsequent sections). First, we give an overview of the notions defined by the XLink language:

The xlink:type attribute selects between four basic types of linking semantics:

simple: roughly, the semantics known from < A href=``...''>,
extended: the same, allowing for multiple targets,
locator: elements with an xlink:type=``locator'' attribute are used for specifying targets that are then used by extended link elements,
arc: elements with an xlink:type=``arc'' attribute are used for defining bidirectional links between a set of locator elements.

In most cases, the information that some element type is equipped with link semantics is given in the DTD. Here, attributes from the xlink: namespace are declared and some of them are assigned with a fixed value:

   <!ELEMENT linkelement (contentsmodel)>
     <!-- Dependent on the xlink:type, some conditions on
          contentsmodel apply. -->
   <!ATT LIST linkelement
     xlink:type  (simple|extended|locator|arc)  #FIXED ``...''
     xlink:href  CDATA #REQUIRED 
     xlink:actuate  (auto|user)  #IMPLIED 
     xlink:show  (new|parsed|replace)  #IMPLIED 
     xlink:role  CDATA  #IMPLIED 
     xlink:title  CDATA #IMPLIED> 
   where linkelement is any element type of the application to be 
      augmented with link semantics

The xlink: attributes have the following semantics:

xlink:href: selects a target of the individual instance (it ``is'' the link).
xlink:actuate: defines the event on which the link is activated (when browsing).
xlink:show: specifies what happens if the link is activated (when browsing).
xlink:role and xlink:title: application specific.

In the following, links of the types provided by XLink are introduced for a ``distributed'' version of Mondial where all countries, all cities of a country, all organizations, and all memberships are stored in separate files:

mondial-countries.xml (all countries)
mondial-cities-car-code.xml (the cities for each country)
mondial-organizations.xml (all organizations)
mondial-memberships.xml (relates countries and organizations)

Simple Links

A simple link is similar to the HTML < A href=``...''> construct.

Example 5 The @seat attribute of organizations is replaced by a seat subelement which is a simple link:

  <!ELEMENT organization (... seat ...)>
  <!ELEMENT seat EMPTY>
  <!ATTLIST seat xlink:type (simple|extended|locator|arc) #FIXED ``simple''
                    xlink:href CDATA #REQUIRED >

An excerpt of the XML document mondial-organizations.xml looks as follows:

  < organization id=``org-eu''> 
  < seat href=``file:cities-B.xml#//city[@name='Brussels']''/>
     :
  < /organization>

Inline extended Links

Extended links do not have an href attribute, but instead they contain one or more locator elements, each of which is a reference of its own.

Example 6 In the distributed Mondial model, every country element has a simple link element capital, and an extended, multi-target link element cities that contains locators to all its cities:

  <!ELEMENT country (... cities, capital ... )> 
  <!-- (capital is defined similar to organization/seat) --> 
  <!ELEMENT cities (city*)> 
  <!ATTLIST cities xlink:type (simple|extended|locator|arc) #FIXED ``extended'' >

  <!ELEMENT city (#PCDATA)> 
  <!ATTLIST city xlink:type (simple|extended|locator|arc) #FIXED ``locator''
                 xlink:href CDATA #REQUIRED >

An excerpt of the XML document mondial-countries.xml may look as follows:

  <country car_code=``B''> 
    <capital href=``file:cities-B.xml#//city[@name='Brussels']''/>
    <cities>
      <city href=``file:cities-B.xml#//city[@name='Brussels']''> 
          Brussels</city>
      <city href=``file:cities-B.xml#//city[@name='Antwerp']''>
          Antwerp</city>
        :
    </cities> 
  </country>

Out-of-Line-Links

Out-of-line-links allow to create references not only inside documents, but also to create XML instances that consist only of links between other documents. This can be seen as a ``personalized overlay'' on arbitrary existing (autonomous) sources. Here, the link is not a directed link to a node, but a relationship (or a set of relationships) between a set of nodes. Out-of-line-links consist of locator subelements that specify potential endpoints of links, and arc elements that specify the relationships between these endpoints:

  <!ELEMENT linkelement (...|locatorelement|arcelement)*>
  <!ATTLIST linkelement as above >

  <!ELEMENT locatorelement as above >
  <!ATTLIST locatorelement as above, including type, href, id >
  <!ELEMENT arcelement (contentsmodel)> 
  <!ATTLIST arcelement xlink:type (simple|extended|locator|arc) #FIXED ``arc''
                       xlink:from IDREF #REQUIRED 
                       xlink:to   IDREF #REQUIRED >

Example 7 For the distributed Mondial database, the memberships of countries in organizations are stored in mondial-memberships.xml:

  <!ELEMENT memberships (country*,organization*,membership*)> 
  <!ATTLIST memberships xlink:type (simple|extended|locator|arc)  #FIXED ``extended'' >

  <!ELEMENT country EMPTY> 
  <!ATTLIST country 
               xlink:type (simple|extended|locator|arc) #FIXED ``locator''
               xlink:href CDATA #REQUIRED 
               id ID #REQUIRED >

  <!ELEMENT organization EMPTY>
  <!ATTLIST organization 
               xlink:type (simple|extended|locator|arc) #FIXED ``locator''
               xlink:href CDATA #REQUIRED 
               id ID #REQUIRED >

  <!ELEMENT membership EMPTY> 
  <!ATTLIST membership 
               xlink:type (simple|extended|locator|arc) #FIXED ``arc''
               xlink:from IDREF #REQUIRED
               xlink:to   IDREF #REQUIRED 
               membership_type CDATA #REQUIRED >

The file mondial-memberships.xml consists now of only one memberships element:

  <memberships> 
    <country id=``B''
       xlink:href=``.../countries.xml#//country[@car_code='B']''/>
    <country id=``D''
       xlink:href=``.../countries.xml#//country[@car_code='D']''/> 
    <organization id=``org-EU''
       xlink:href=``.../organizations.xml#//organization[@abbrev='EU']''/>
    <organization id=``org-UN''
       xlink:href=``.../organizations.xml#//organization[@abbrev='UN']''/>
    <membership 
      xlink:from=``B'' xlink:to=``org-EU'' membership_type=``member''/>
    <membership 
      xlink:from=``B'' xlink:to=``org-UN'' membership_type=``member''/>
  </memberships>

3. Data Models and Queries for Linked Documents

XPointer and XLink specify how to express inter-document links in XML. There is not yet an official proposal

how to add link semantics to the actual data model, e.g., the DOM or the XML Query Data Model [XMLQDM], and
how to handle links in queries and applications (which in part depends on the data model, but orthogonally, evaluation strategies have to be defined).

In this section, we describe two alternative extensions of the data model with link elements. Evaluation issues, e.g., when and where links are evaluated, are then investigated in Sections 4-6. Each link can be seen as a view definition -- possibly recursively containing further links; in this case, the view may be even infinite (due to cyclic references). Whereas in SQL, a view or a database link appears as a table or a database schema that easily fits with the language syntax and semantics, links as embedded views need one of the following semantic extensions:

extending the abstract data model with a linking construct. In this case, an additional syntactic construct in form of an an explicit navigation operator is required.
the abstract data model makes the links transparent, i.e., the embedded views implicitly become subtrees that can be queried by standard XPath expressions.

3.1 An Explicit Navigation Operator

The query in the Example 2 uses the XQuery [XQuery] syntax with the dereferencing operator for IDREF(S), ``=>nodetest''. Similarly, we define the dereferencing operator ``~>'' for following XPointers in xlink:href attributes for explicit inter-document navigation.

Example 8 Consider again the query given above: The entry point for the query is the document mondial-memberships.xml, iterating over all membership elements (e.g., (``B'',``EU'')),

following the link to the corresponding organization in mondial-organizations.xml, searching for the <organization id=``org-EU''> element, which contains a seat subelement (which is a simple link). The target of the link is a city element in mondial-cities-B.xml whose name attribute has to be retrieved.
following the link to the corresponding country in mondial-countries.xml, searching for the <country car_code=``B''> element, which contains another (simple) xlink subelement capital. The target of the link is the city element
```
    <city name=``Brussels''/>
  
```
in mondial-cities-B.xml; again, the name attribute has to be retrieved and compared with the result from the first search.

For illustrating in which document the current navigation step is applied, the namespaces ms: (in mondial-memberships.xml), org: (in mondial-organizations.xml), ctry: (in mondial-countries.xml), and cty: (in mondial-city-XXX.xml) are used:

 FOR $ms IN document(``mondial-memberships.xml'')/ms:memberships/ms:membership,
     $org IN $ms/@xlink:to=> ms:organization/@xlink:href~>org:organization,
     $abbrev IN $org/@org:abbrev,
     $seatname IN $org/org:seat/@xlink:href~>cty:city/@cty:name, 
     $capname IN $ms/@xlink:from=>ms:country/@xlink:href~>ctry:country/
                        ctry:capital/@xlink:href~>cty:city/@cty:name,
 WHERE $capname = $seatname 
 RETURN < result org = {$abbrev} city = {$seatname}/>

3.2 Transparent Links

In the above example, an explicit operator for dereferencing links has been introduced. Another possibility is to regard link elements to be transparent, i.e., the (abstract) data model should silently replace link elements of the types xlink:simple xlink:locator, and xlink:arc by the result sets of their XPointers (for xlink:type=arc elements, the from and to attributes are replaced by reference attributes whose name is the xlink:role type of the corresponding xlink:locator nodes (if no role is given, the name of the locator node is taken instead)). Note that these partially ``virtual'' elements must accumulate the attributes and contents of the original link element (in case it has non-xlink contents) and of the linked element.

Example 9 With transparent links, the virtually linked XML trees look as follows:
In mondial-memberships.xml, the <country xlink:type=``locator'' ...> elements are ``instantiated'' with the document(``mondial-countries.xml'')/country elements resulting from the xlink:href attribute; analogously for the <organization xlink:type=``locator'' ...> elements. Similarly, In the <membership xlink:type=``arc''> elements, the xlink:from attribute is replaced by a country reference attribute (taking the name of the element referenced by the xlink:from attribute); analogously for the xlink:to attribute:

  <memberships>
   <!-- originally: country xlink:type=``locator''
                            xlink:href=``.../countries.xml'')#//country[@car_code='B']'' --> 
   <country id=``B''> 
     <!-- with attributes and contents of 
          document(``.../countries.xml'')//country[@car_code='B'] --> 
   </country> 
   <!-- originally: organization xlink:type=``locator'' 
                                 xlink:href=``.../organizations.xml#//organization[@abbrev='EU']'' --> 
   <organization id=``org-EU''> 
     <!-- with attributes and contents of 
          document(``.../organizations.xml'')//organization[@abbrev='EU'] --> 
   </organization> 
   <membership
      <!-- originally: xlink:from=``B'' xlink:to=``org-EU'' 
              together with types/element names of the referenced locator elements --> 
      country=``B'' organization=``EU'' membership_type=``member''/>
     :
  </memberships>

Similarly, in mondial-organizations.xml, the <seat xlink:type=``simple''> subelements are resolved by the corresponding document(``mondial-city-XXX.xml'')/city elements:

  <organization abbrev=``EU''> 
    <seat> 
    <!-- href=``file:cities-B.xml#//city[@name='Brussels']''-->
    <!-- attributes and contents of  
         document(``.../mondial-cities-B.xml'')//city[@name='Brussels'] -->
    </seat>
     :
  </organization>

Analogously, the capital subelements in mondial-countries.xml are (virtually) resolved.

Then, the query in the virtual linked trees reads in XQuery syntax as

 FOR $ms IN document(``mondial-memberships.xml'')/ms:memberships/ms:membership,
     $org IN $ms/@ms:organization,                         <!-- dereferencing the arc element -->
     $abbrev IN $org/@org:abbrev, 
     $seatname IN $org/org:seat/@cty:name,                 <!-- dereferencing the simple link -->
     $capname IN $ms/@ms:country/ctry:capital/@cty:name    <!-- dereferencing the simple link -->
 WHERE $capname = $seatname 
 RETURN <result org = {$abbrev} city = {$seatname}/>

Another question is, whether virtual subelements via links should be considered as descendants. For transparency reasons, it is sometimes desirable to regard linked XML instances as one. Then, queries are robust against some restructuring of information servers. On the other hand, the set of descendants may become arbitrary large. Additionally, navigation using // has to resolve (transitively) all links. Thus, it is desirable that the behavior of a link in this aspect can be defined (cf. Section 6.4).

4. View Evaluation and XML Information Server Cooperation

For evaluating XPath expressions that navigate through link elements, there are several possibilities, concerning the issues, when and where the views defined by links are evaluated. Thus, when links are involved in queries, several queries/views have to be evaluated, and possibly combined:

the link url#xpath-expr_xl defines views on different levels:
- wrt. the source document, the whole link is a view (in the sense of a virtual tree extension),
- and wrt. the referenced url url, the XPointer portion xpath-expr_xl defines a view.
in general, a query does not need the whole result of the link, but in course contains a query against it. Thus, the query which is stated against the source document can be decomposed as xpath-expr_q1/xpath-expr_q2 where the result set of xpath-expr_q1 contains the link node and xpath-expr_q2 is a ``subquery'' which is evaluated against the view which is defined by the link.

Thus, for the actual data exchange, several possibilities of materializing one of the views emerge:

transfer the whole contents of url and evaluate xpath-expr_xl/xpath-expr_q2.
submit the query xpath-expr_xl to the server at url who materializes and transfers the result. Then, evaluate xpath-expr_q2 against it.
submit the query xpath-expr_xl/xpath-expr_q2 to the server at url (see Section 4.2). In this case, the views defined by the link remain completely virtual also during evaluation.

When the target of the link resides on a remote server, the actually applicable strategies depend on the provided services (i.e., whether the whole XML document is accessible, and/or what queries can be answered), cf. Section 6.2. We first sketch two potential evaluation strategies, and then identify and investigate systematically the parameters for specifying the behavior of link evaluation.

4.1 Local Evaluation

Current XML querying tools are restricted to evaluate XPath expressions wrt. the DOM-resident XML tree (which may contain several documents). When a query is parsed, the documents occurring in expressions of the form document(url) are added to the DOM to be evaluated. When handling links, the XPointers giving the urls and XPath expressions to be evaluated are not contained in the query, but in the data, thus they are only known at runtime. The XQuery pattern below describes how an actual query xpath-expr_q1/xpath-expr_q2 where the result set of xpath-expr_q1 consists of link elements can be evaluated in Dynamic SQL or JDBC style:

    <!-- evaluate ``FOR X IN xpath-expr_q1/xpath-expr_q2''  -->
    FOR  $xlink IN xpath-expr_q1 
    LET  $url = string-before(``#'',$xpointer), 
         $path = string-after(``#'',$xpointer), 
         $view = document($url)/$path    <!-- XPath expression over variables --> 
    FOR  $x IN $view/xpath-expr_q2
         :

Such queries are not supported by the current querying languages. Using XQueryX, the above behavior can be implemented in XSLT [XSLT].

4.2 Remote Evaluation: Keeping the Views Virtual.

When evaluating xpath-expr_q1/xpath-expr_q2 as above, for a link

  <foo xlink:type=``simple'' href=``url#xpath-expr_xl'' .../>

belonging to the result set of xpath-expr_q1, neither the document at url nor the result set of url#xpath-expr_xl are actually needed, but only used for evaluating xpath-expr_q2} over the result set. Thus, instead the rewritten query

  xpath-expr_xl/xpath-expr_q2}

can evaluated against url to answer the original query. In general, the result will be much smaller than when evaluating xpath-expr_xl -- and the computation is done at the server which is potentially optimized wrt. its database. Example 10 In our example, the source mondial-memberships.xml contains locator elements like

    <country id=``B'' xlink:href=``.../countries.xml#//country[@car_code='B']'' />

that are used when answering the query

  (*)    $capname IN~~$ms/@xlink:from=>ms:country/@xlink:href~>ctry:country/
                                       ctry:capital/@xlink:href~>cty:city/@cty:name

for the above XPointer, the whole ctry:country element with car_code=``B'' is returned, and then queried by

    ctry:capital/@xlink:href~>cty:city/@cty:name~.

Instead, countries.xml can be queried for

    //ctry:country[@ctry:car_code='B']/ctry:capital/@xlink:href~>cty:city/@cty:name

which will in turn query cities-B.xml with

    //cty:city[@cty:name=``Brussels'']/@cty:name

which just returns ``Brussels'' which is also the answer to (*)

XML Information Server Cooperation.

As already stated, the applicability of both strategies depends on the services provided by the remote data source, cf. Section 6.2. As a straightforward strategy, the client will first try to send the rewritten query xpath-expr_xl/xpath-expr_q2 via HTTP to the server at url where it will be answered if the server is an XML database system. If the query is not answered, the client will try to access the XML document, parse it into its own database, and evaluate the query by itself. More involved strategies that are tailored to the needs of a given application are described in the following. Next, we investigate the ``when'' aspect, i.e., if the result sets of link elements should be prefetched when parsing an XML instance, or if links are evaluated on demand.

5. Materializing and Prefetching Results

Similar to relational views, strategies for saving computational effort when answering queries apply to the views induced by links. Note that here, not only computational expenses (either at the local or the remote server) may be saved, but also internet communication can be reduced. Since the required links are already known when parsing an XML instance, already at this stage, the whole virtual tree can be materialized locally: when the view is ``defined'' by parsing an XML document, its extension can be computed (either at the local or the remote server) and stored. Then, queries can be evaluated based on the precomputed view. Note that even the whole remote XML instance can be stored locally, which may reduce the internet communication in case that there are several links that use the same url. In both cases, no internet access is required at query answering time. On the other hand, the actual evaluation of queries wrt. the materialized link views must then be done locally wrt. the cached data. Another possibility is to cache the answer when a query against the remote source via a link is evaluated for the first time. Again, either the view defined by the link (i.e., the result of xpath-expr_xl) can be materialized as a subtree (and used later also for other queries), or the answer to a given query (i.e., xpath-expr_xl/xpath-expr_q2}) is stored, keeping the link subtrees virtual. Similar to relational databases, if an answer or view is cached, the consequence is that subsequent answers use the stored state instead of the current state of the remote source. On the other hand, this guarantees that all queries are evaluated wrt. the same state of the remote source.

Summarizing, when querying links, several parameters should be specifiable:

when is the link evaluated (at parse time or at query answering time),
where does the computation take place (locally or at the remote server), and
which (intermediate) results should be stored and reused.

Furthermore, there are interferences between the above options. In the subsequent section, we propose an extension to the xlink: namespace that allows for specifying the behavior of links wrt. querying. For these extensions, we use the dbxlink namespace (denoting the database aspect of XLink).

6. Specification Parameters for Evaluating Links

XLink [XLink] defines several attributes for link elements that specify the behavior of the link element, i.e., when it should become ``activated'' and what happens then. In the current version, this behavior is tailored to the use of links when browsing, it does not cover the requirements of querying XML instances. As defined above, there are several aspects for specifying the behavior of links wrt. querying:

Activating Event: The activating event of links is considered in XLink with the dbxlink:actuate attribute: the value auto states that the link is activated when it is parsed, whereas user states that it is activated by the user (HTML: clicking). In our context, auto means that the XPointer is evaluated when the node containing it is parsed whereas user denotes that it is evaluated when it is used by a query. We add an dbxlink:actuate=``noaction'' alternative that does nothing: the unresolved link remains -- as a kind of view definition -- in the answer tree, but without the dbxlink:actuate=``noaction'' attribute. The intention here is that the application/user who stated the query should decide (and possibly pay) by himself when to finally resolve the link. Especially, information servers will use this alternative (thus, returning unresolved links to be processed by the client).
Evaluation: As described above, for the evaluation -- both the evaluation of the link when parsing, or the evaluation of the link together with the remaining query on demand -- there are several possibilities:
- dbxlink:eval=``local'' means that the whole target document is requested and xpath-expr_xl (when parsing) or xpath-expr_xl/xpath-expr_q2 (when answering a query) is evaluated locally.
- dbxlink:eval=``distributed'' states that xpath-expr_xl is submitted to the remote server, and the result (which is the view represented by the link) is transferred. When a query is evaluated, the remaining xpath-expr_q2 is evaluated locally.
- Finally, dbxlink:eval=``remote'' means that xpath-expr_xl (when parsing) or xpath-expr_xl/xpath-expr_q2 (when evaluating a query) is submitted to the remote server for evaluation, and only the final result is transferred.
- dbxlink:eval is NMTOKENS, thus, combinations like dbxlink:eval=``remote local'' mean that first, it is tried to make the remote server answer the complete query. If this fails, the document is requested and the evaluation is done locally.
Answer Caching: When computing the result or accessing the target server (for the complete document), the obtained data may be cached:
- dbxlink:cache=``complete'' parses the whole target document (useful in case that there are many links to the document, such that it makes sense to regard it as a part of the database) and stores it (centrally) in the local XML database. When later a query traverses the link, it has only to evaluate xpath-expr_xl on the already parsed document.
- dbxlink:cache=``transparent'' replaces the link by the result set of document(url)/xpath-expr_xl (i.e., materializing the answer as in Example 9).
- dbxlink:cache=``answer'' stores the answer to a query document(url)/xpath-expr_xl/xpath_q2, against a remote source and keeps it somewhere in the database (useful if the same link or query occurs several times).
- default is dbxlink:cache=``none''.
- dbxlink:cache is NMTOKENS, thus, combinations like dbxlink:cache=``complete answer'' mean that the whole document is cached (in case there are many links to it) and also the answers to individual queries are cached.
Note that the caching specifications strongly depend on the evaluation specifications: only such information can be cached that is actually communicated to the server; on the other hand, caching at rough granularity makes only sense if the information is actually used by subsequent query evaluations:
- dbxlink:cache=``complete'' makes only sense for dbxlink:eval=``local'',
- dbxlink:cache=``transparent'' (materializing the link view) makes sense for dbxlink:eval=``local'' and dbxlink:eval=``distributed''.

6.1 Runtime Behavior

The actual evaluation of links happens either when parsing a document into the local XML database, or when answering queries.

Parsing

When parsing a document the links with dbxlink:actuate=``auto'' are evaluated and cached: If dbxlink:cache=``complete'', the whole remote document is requested and parsed; if dbxlink:cache=``transparent'', the view defined by the individual link is materialized as a subtree. In the latter case, the value of dbxlink:eval specifies where the actual computation of document(url)/xpath-expr_xl is carried out. Note that links specified as dbxlink:cache=``answer'' cannot be precomputed at parse time, since there is not yet any query. In the abstract data model, the resulting subtrees are replaced for the link elements, but the actual implementation may store them separately.

Evaluating Queries

When the database is actually used for evaluating queries, the answers that have been cached either from parsing the document, or from previous queries are reused. Specifying dbxlink:cache=``none'' disables caching for a query, e.g., when the remote data sources may change and the current state is required. In other cases, either (i) the link is evaluated against the cached document (for dbxlink:cache=``complete''), or (ii) the materialized subtree that replaced the link is used (for dbxlink:cache=``transparent''), or (iii) the cached answer from a previous evaluation of the same query is used. In all cases, the cached data may also be stored from another query that uses the same remote data source, link, or the same query against the link. If there is not yet a precomputed answer (including the case dbxlink:cache=``none''), the evaluation strategy given by dbxlink:eval specifies how the answer is computed, and according to dbxlink:cache, the intermediate or final results are cached.

Possible Combinations.

The table given in below shows which actions are necessary when a link is traversed the first time, and all subsequent times for all possible combinations of specification parameters.

The following strategies are the cornerstones of the spectrum:

The strategy (dbxlink:actuate=``auto'' dbxlink:cache=``transparent answer'') guarantees the best performance for the query answering tasks: all link views are precomputed as transparent subtrees after parsing the document, and later, also all answers are added to the cache.
The strategies (dbxlink:actuate=``user'' dbxlink:cache=``none'' dbxlink:eval=``remote'') and (dbxlink:actuate=``user'' dbxlink:cache=``answer'' dbxlink:eval=``remote'') are the ``leanest'' ones: only the necessary computations are done (by the remote server). The first variant does not cache anything (minimizing space, and always using the current state of the remote data source), the second one caches the answers (minimizing communication).
On the other hand, specifying (dbxlink:actuate=``auto'' dbxlink:cache=``complete transparent'' dbxlink:eval=``local'') makes the system as independent as possible from the remote source, which is especially useful if transient sources are used, or if the internet connection is not persistent.

	dbxlink:actuate=``auto''
`dbxlink:eval` x `dbxlink:cache`	lc	lt	la	ln	dt	da	dn	ra	rn
use cached document	33	--	--	--	--	--	--	--	--
access remote document	--	--	9-	99	--	--	--	--	--
add document to cache	--	--	--	--	--	--	--	--	--
use transparent subtree	--	11	--	--	11	--	--	--	--
compute link view remote	--	--	--	--	--	1-	11	1-	11
transfer link view	--	--	--	--	--	4-	44	--	--
compute link view locally	33	--	3-	33	--	--	--	--	--
replace link by subtree	--	--	--	--	--	--	--	--	--
use cached answer	--	--	-1	--	--	-1	--	-1	--
compute answer remote	--	--	--	--	--	--	--	1-	11
transfer answer	--	--	--	--	--	--	--	1-	11
compute answer locally	22	22	2-	22	22	2-	22	--	--
add answer to cache	--	--	1-	--	--	1-	--	1-	--

	dbxlink:actuate=``user''
`dbxlink:eval` x `dbxlink:cache`	lc	lt	la	ln	dt	da	dn	ra	rn
use cached document	-3	--	--	--	--	--	--	--	--
access remote document	9-	9-	9-	99	--	--	--	--	--
add document to cache	3-	--	--	--	--	--	--	--	--
use transparent subtree	--	-1	--	--	-1	--	--	--	--
compute link view remote	--	--	--	--	1-	1-	11	1-	11
transfer link view	--	--	--	--	4-	4-	44	--	--
compute link view locally	33	3-	3-	33	--	--	--	--	--
replace link by subtree	--	1-	--	--	1-	--	--	--	--
use cached answer	--	--	-1	--	--	-1	--	-1	--
compute answer remote	--	--	--	--	--	--	--	1-	11
transfer answer	--	--	--	--	--	--	--	1-	11
compute answer locally	22	22	2-	22	22	2-	22	--	--
add answer to cache	--	--	1-	--	--	1-	--	1-	--

1st entry: first time		- : no action required
2nd entry: all subsequent times		n: action required (estimated cost: n)

Note that multigranularity caching (e.g., dbxlink:cache=``lcta'') adds additional variants.

Also, previous operations may have cached useful information (e.g., pages)

Figure 2: Required Actions when Traversing a Link

Using dbxlink:cache=``complete transparent'' provides a double caching which is useful when the target document is referenced by different XPointers, together covering large parts of the target document, and each of the XPointers occurs in several links. Note that the actual runtime behavior can also be influenced by different specifications for different links in a document, e.g., when another link already accessed the complete document or another link has already precomputed the answer.

6.2 Constraints

In practice, the functionality which is provided by the remote servers constrains the possible alternatives:

``Lazy'' Servers

If only a plain XML document (e.g., as ASCII) is provided, the client that initiated the query has to access the document and evaluate the XPointer expression by itself (dbxlink:eval=``local''). Then, the client has to decide if he accepts the link strategies defined in the document.

``Hiding'' Servers

If the referenced document is located on a server that does not publish the whole document, but only answers queries, dbxlink:eval=``local'' (and also dbxlink:cache=``complete'') is not applicable. Note that even servers that answer only a restricted set of XPath queries are possible. In this case, it may be necessary to rewrite queries according to the provided views.

Costs

Additionally, for the user of information services, the costs associated with an information source are relevant:

dbxlink:eval=``local'' (especially, without dbxlink:cache=``complete'') requires transferring the whole XML document may be expensive when the costs depend on the size of data,
Any strategy without dbxlink:cache=``complete'' may be expensive when the costs depend on the number of queries, especially, if no caching at all is used.
dbxlink:eval=``remote'' is expensive wrt. the remote computation time, especially, if the answers are not cached.
Obviously, dbxlink:cache=``none'' is potentially expensive in all cases: with dbxlink:eval=``local'', the transferred volume is excessive (always, the whole XML documents are transferred), and with dbxlink:eval=``remote'', the remote computation time can be considerable.

So far, the strategies have been investigated from the point of view of the ``client''.

6.3 Information Providers

For an information provider, dbxlink:actuate=``noaction'' assures that he will not have any effort (neither computational resources, nor costs) for resolving the links in the documents he provides. Obviously, then, the provider is not able to answer XPath queries, not only because of his decision not to have any expenses wrt. computational resources, but also since the links in the documents are opaque for him. The provider must deny any access with dbxlink:eval=``remote'' and dbxlink:eval=``distributed'' (hopefully, the client specified a fallback to dbxlink:eval=``local'' for links pointing to this server). Thus, clients may only access the whole XML document and to provide an overriding specification for dbxlink:actuate=``noaction'' how to handle links (pointing to third party documents) in this document.

6.4 Data Model

Regarding the views defined by links as subtrees that replace the original link element, the data model, i.e., the definition of the tree, must be ``continued'' with these subtrees. Here, an important decision is whether the elements of the subtree are full descendants wrt. the // operator:

Descendants strategy: XLink does not address this issue. We propose an additional attribute dbxlink:descendants that can be used to specify if the link should be transparent wrt. the descendant axis:
- the link itself may specify if it wants to be transparent wrt. the descendants relation (``yes''/``no''),
- the application may specify which elements in virtual subtrees resulting from transparent links are regarded as descendants: all, none, sameserver (those resulting from links where the target url is on the same server as the source url), or predefined (use the strategy given by the link).

When a link is specified with dbxlink:descendants=``yes'', and dbxlink:descendants=``all'' is specified by the application, the link must be considered whenever it is involved in a descendant navigation step. Thus, it is often preferable to materialize it, or to use additional metadata information to check if it may yield relevant answers (see below).

Referential Integrity.

Another problem -- which is in some sense concerned with the data model -- is the maintenance of referential integrity in an interlinked network of sources: links are expected to point to one or more elements that have a certain connection to the parents of the link elements (e.g., the seats of organizations). Here, the xpath-expr_xl query part of the link has to use values for identifying these elements. In the relational model, this is done using keys, leading to the well-known referential integrity maintenance issues. Since XML Schema also provides a notion of keys and key references, these are natural candidates for use in link references. Thus, it is likely that similar problems will arise.

6.5 Optimizations and Perspectives

Resource Descriptions

If the provider of a referenced XML document also provides a resource description, it is often even possible to decide a priori whether the actual query may be successful. Especially, when links are regarded as transparent wrt. the descendant axis, for expressions of the form ...//nodetest, the simple information whether the linked document can contain elements that satisfy the nodetest can save the expense of querying the linked data source.

Example 11 The XPath expression document(``mondial-countries'')//inflation (which selects all inflation elements by following the links to all documents described above) does not need to follow the links to the city documents since these do not contain any inflation element.

With a more detailed metadata description, e.g., in RDF (Resource Description Format), the evaluation of many queries can be optimized.

XML-Aware Web Caches

In the context of frequently used remote information sources, the current ``local'' caching may create redundant cached copies of views. An obvious optimiziation is to use caches at the remote servers that store the results of frequent queries. Another possibility is to use XML-aware caches technology in the Web which allows to store (url,query,answer) tuples that occur frequently. In contrast to the above alternative, this can -- at least in a restricted way -- be implemented by the user, e.g., in a proxy server for his company. This approach is e.g. investigated in [LM01].

7. LoPiX

Currently, the LoPiX system [LoPiX, May01a] is used for investigating the handling of links in an XPath-based environment. LoPiX is an environment for using XPathLog (an extension of XPath, see [May01b] as a language for XML data integration. LoPiX (cf. Figure 3) consists of a central XPathLog evaluation engine, a storage module that implements the internal data model, an Internet access module, and an optional XML/XPath interface for electronic data interchange with external XML-based systems. For the present paper, the data-driven Web access functionality and its transparent resolving of links is relevant.

Figure 3: LoPiX Architecture

The internal database can be extended by new XML documents depending on information and links (i.e., pure urls or links) found in already known documents: Whenever for some url url, the built-in active method url.parse@(xml) is called, the document located at url is automatically loaded as an XML tree. All link elements with dbxlink:actuate=``auto'' are preprocessed according to the values of dbxlink:cache and dbxlink:eval. When a navigation step traverses a link element such that dbxlink:actuate=``user'', the link is evaluated according to the values of dbxlink:cache and dbxlink:eval. Thus, when querying linked XML sources, the actual query syntax is independent from the values of dbxlink:actuate, dbxlink:cache, and dbxlink:eval since these specifications are completely handled by the evaluation component.

8. Conclusion

Related Work.

In contrast to, e.g., the relational model or previous semistructured data models, XML and XLink provide a construct for seamless integration of view definitions into the data via links. This semantics raises new aspects for querying languages. Nevertheless, the most prominent ones, i.e., XML-QL [DFF+99], and Quilt/XQuery [CRF00,XQuery], do not support links as defined by XLink. XSLT [XSLT] allows for handling links by special patterns (to be specified manually).

Conclusion.

Querying distributed XML documents on the Web by declarative languages requires a more detailed specification of xlink-related attributes. Above, we described what attributes concerning evaluation time and place and result caching could be added favorably to the XLink language. We investigated possible semantics and strategies for handling them in a declarative query language. As a result, we propose a ``transparent'' semantics for XLink elements, similar to views. Especially when information sources are used that can answer XPath queries on their contents, on-demand resolving of links can be used to minimize the answer and to exploit the server-side optimization in an optimal way. In addition to a proposal how to handle links in XML querying frameworks, we have identified the following areas for further investigations: (i) XML/XPath query rewriting, (ii) XML-aware Web caches, and (iii) intra-document and intra-source referential integrity maintenance.

References

[CRF00] D. Chamberlin, J. Robie, and D. Florescu. Quilt: An XML Query Language for Heterogeneous Data Sources. In WebDB 2000, pp. 53-62, 2000.

[DFF+99] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A Query Language for XML. In 8th. WWW Conference, 1999. World Wide Web Consortium Technical Report, NOTE-xml-ql-19980819, www.w3.org/TR/NOTE-xml-ql.

[LM01] P. J. Marrón and G. Lausen. On Processing XML in LDAP. In Intl. Conf. on Very Large Data Bases (VLDB), 2001.

[LoPiX] The LoPiX System. http://www.informatik.uni-freiburg.de/~may/lopix/.

[May01a] W. May. LoPiX: A System for XML Data Integration and Manipulation. In Intl. Conf. on Very Large Data Bases (VLDB), Demonstration Track, 2001.

[May01b] W. May. XPath-Logic and XPathLog: A Logic-Based Approach for Declarative XML Data Manipulation. Habilitation Thesis, Universität Freiburg, Institut für Informatik, 2001. Available from http://www.informatik.uni-freiburg.de/~may/lopix/.

[Mondial] The Mondial Database. http://www.informatik.uni-freiburg.de/~may/Mondial/.

[XLink] XML Linking Language (XLink) Version 1.0. W3C Recommendation. http://www.w3.org/TR/xlink, June 2001.

[XMLQDM] XQuery 1.0 and XPath 2.0 Data Model. W3C Working Draft (work in progress). http://www.w3.org/TR/query-datamodel, June 2001.

[XPath] XML Path Language (XPath) Version 1.0. W3C Recommendation. http://www.w3.org/TR/xpath. Version 1.0 (1999) and 2.0 (2001).

[XPtr] XML Pointer Language (XPointer) Version 1.0. W3C Candidate Recommendation. http://www.w3.org/TR/xptr, Sept. 2001.

[XQuery] XQuery: An XML Query Language (Version 1.0). W3C Working Draft (work in progress). http://www.w3.org/TR/xquery, June 2001.

[XSLT] XSL Transformations (XSLT) Version 1.0. W3C Recommendation. http://www.w3.org/TR/xslt, Nov. 1999.