Efficient Queries: XQuery

This document describes how to increase the performance of Tamino XQuery 4. The following topics are covered:


Using Indexes

XQueries perform much faster if indexes can be used. If an element or attribute is often referred to in a query, it is recommended to define a standard index upon this item in the schema. If this item is moreover used for text retrieval, as, for example, with the contains() function, a text index is even more appropriate.

An index - although defined in the schema - might not be used when executing the query. This happens in cases where the connection between the database items to be retrieved and the schema entry upon which the index is defined is not recognized. Reasons for this may be an open-content schema or a path expression in the query that contains element steps, thus passing element definitions for which an anyType content model is defined. So for using indexing from within queries, try to keep the areas of uncertain structure very limited, i.e. use close-content schemas and only allow anyType where absolutely necessary.

Constructors

Constructors are very helpful to generate XML structures within a query. But used in the wrong way, constructors can degrade the query performance considerably.

Avoid Unnecessary Constructors

First of all, you should avoid constructing XML structures that do not belong to the final query result. The following query gives an example for unnecessary XML construction:

for      $a in input()/bib/book
where    for $b in $a/title 
         where tf:containsText($b,"XQuery") 
         return <true/>
return $a

The constructor in the return clause of the nested sub-query creates a true element that just indicates the occurrence of a node that satisfies the given search predicates. It is far better to return the matching node:

for      $a in input()/bib/book
where    for $b in $a/title 
         where tf:containsText($b,"XQuery") 
         return $b
return $a

Avoid Sorting the Results of an XML Construction

According to the XQuery draft, the result of an XML construction loses all the type information of the parts it has been constructed from. For example, the following query creates a book list where each entry contains a book element with two attributes holding the title and the publishing year:

for    $a in input()/bib/book
return <book title="{$a/title}" year="{$a/@year}"/> 

To sort the result by the publishing year, a sort by clause can be appended to the query. Since the year attribute of the constructed year attribute has lost all of its type information, an xs:integer() casting function has to be applied to get the correct ordering:

for    $a in input()/bib/book
return <book title="{$a/title}" year="{$a/@year}"/>
sort by(xs:integer(./@year)) 

A more severe consequence of the construction is that the optimizer is not able to perform an index-based sorting. This makes the above query very slow if applied to a big document set. To get around this problem, the sort by clause should be moved to the for clause:

for $a in input()/bib/book 
    sort by(./@year)
return <book title="{$a/title}" year="{$a/@year}"/> 

Due to the order preserving property of an FLWOR expression in XQuery, the ordering created by the sort by clause is maintained. With the newly introduced order by clause, the query can also be stated like this:

for $a in input()/bib/book 
    sort by(./@year)
    order by $a/@year
return <book title="{$a/title}" year="{$a/@year}"/> 

Disjunctive Predicates

Queries with complex search predicates involving "and" and "or" operations are difficult to evaluate efficiently. Most problematic are "or" operations, which may cause index access operations returning big result sets. The number of "or" operations can be reduced if they are applied on comparison predicates that read the same elements or attributes. For example, assuming that we want to get all the entries in a bibliography referencing a book written by Heinrich or Thomas Mann. The query can be stated like this:

for    $b in input()/bib/book
where   $b/author[last = "Mann" and (first ="Heinrich" or first ="Thomas")]
return $b

The search predicate of the query can be simplified in the following way:

for    $b in input()/bib/book
where   $b/author[last = "Mann" and first =("Heinrich","Thomas")]
return $b

The query compares the content of the "first" element with the sequence ("Heinrich","Thomas"). According to the definition of the general comparison in XQuery, this means that the comparison is successful if the content of the "first" element is either equal to "Heinrich" or "Thomas". The big advantage of using this kind of predicate is that only index access is performed during evaluation.

Negated Predicates

Tamino 4.4 introduces the index-based evaluation of negated predicates in XQuery expressions. For negating expressions, XQuery provides the fn:not() function. For example, to find those books that do not have any title with the value "Data on the Web" the not() function can be used as shown in the following query:

for    $a in input()/bib/book
   where not($a/title = "Data on the Web")
return $a

The following more complex example query retrieves all books except those that were not published in 2000 and written by Dan Suciu:

for    $a in input()/bib/book
    where not($a/@year = 2000 and $a/author[first="Dan" and last="Suciu"])
return $a

Beside comparison predicates, the fn:not() function is also useful in combination with text-retrieval predicates. The following query that retrieves all books that do not have the word "Web" in their title illustrates this:

for    $a in input()/bib/book
    where not(tf:containsText($a/title, "Web"))
return $a

Also, the absence of elements or attributes can be checked via the not function:

for   $a in input()/bib/book
    where not($a/author)
return $a 

The query finds all books with an empty author list. In summary, index-based evaluation for negated predicates is supported for the following predicate types:

  • Comparison predicates on elements and attributes

  • Text-retrieval predicates

  • Existence checks on elements and attributes

Value Range Predicates

A value range query queries a value range by a conjunctive of two predicates affecting the same node. One predicate specifies a lower bound, and the other specifies an upper bound. For example, the following query finds all books published between 1996 and 2001.

for    $a in input()/bib/book
    let $y := $a/@year
    where $y >= 1996 and $y<= 2001
return $a

Queries with value range predicates can be optimized by accessing a standard index. Therefore, the value range predicates have to be identified and propagated to a standard index access. For identifying a given conjunction of two comparisons as value range predicate, it has to be verified that both comparisons are acting on the same node. Therefore, the following restrictions hold:

  • Node expression must be a variable, variables must be equal for both comparisons.

  • If one comparison is a general one, the node delivered by the variable must not be multiple.

As a consequence of the given restrictions, the value range predicate of the example query can be optimized, if the year element is not multiple concerning the book element.

Position Range Predicates

A position range predicate is a predicate qualifier that selects a range of items from a given input sequence. Therefore, it provides a lower and an upper bound for the position of the items that should be part of the result. The following query provides an example for a position range predicate:

(input()/bib/book) [position() >= 10 and position() < 20]

The query selects those "book" elements with a position greater than or equal to 10 and less than 20. Special range predicates result from implicit lower or upper bounds. The following query selects the first 10 books of the given doctype.

(input()/bib/book) [position() <= 10]

Here the lower bound is implicitly given. In the next query, the upper bound is implicitly given.

(input()/bib/book) [position() >= last() - 10]

The query selects the last ten books in the given doctype. The following list gives some examples of range predicates that can be optimized.

  • [position() < last() –10]
  • [position() > last() – 10]
  • [position () > 10 and position() < 21]
  •  [position () < 21 and position() > 10]

Position range predicates are optimized by reducing the number of documents that have to be read during query execution. The prerequisite is that the cardinality of the document set that is read by scanning an index or a doctype is not changed during post-processing. This means that the query must not contain path expressions that deliver more or less than a single item for each document. Another point is that all search predicates must be completely processed using an index access. Assuming that the following assumptions are true:

  • Every document of the "bib" doctype holds exactly one title element.

  • A document of the "bib" doctype may hold more than a single author element.

  • There is a standard index on the title element.

Examples of queries that can be optimized are given in the following list.

  • (input()/bib/book/title) [position() < last() –10]
  • (for $b in input()/bib/book where $b/title = "Data on the Web" return
           $b) [position() < last() –10]
  • (for $b in input()/bib/book where $b/author = "Dan Suciu" return $b)
           [position() < last() –10]

Examples for queries that cannot be optimized are:

  • (input()/bib/book/author) [position() < last() –10]
  • (let $b := input()/bib/book where $b/title = "Data on the Web" return
           $b) [position() < last() –10]

Join Ordering

Due to XQuery’s ordered data model, the order of "for" clauses in a FLWOR expression is relevant. This means Tamino does not change the order of "for" clauses in a FLWOR expression. Thus you should reorder the "for" clauses manually, if the result order does not matter. A rule of thumb is to start with "for" clauses that iterate expressions that produce small results and that can be evaluated via an index access. For example:

Instead of

for    $a in input()/customer, 
    $b in input()/vendor
    where   $b/@vno = $a/vno and $b/vname = "Schmidt"  
return
            <customers_of_vendor>
            { $b/vname  }
            { $b/@vno  }
            { $a/custname }
            </customers_of_vendor>

Use:

for    $b in input()/vendor
    $a in input()/customer, 
    where   $b/@vno = $a/vno and $b/vname = "Schmidt"  
return
            <customers_of_vendor>
            { $b/vname  }
            { $b/@vno  }
            { $a/custname }
            </customers_of_vendor>

This is because the first variable $a in the first example has to be assigned to all customer documents, whereas the variable $b in the second example has to be assigned only to the vendors with a specific "vname". This is much more efficient assuming there is a standard index on "vname".

Index-only Joining

Join queries are evaluated by the Tamino XQuery in an index-based nested-loop manner. This works well if the input cardinality is small. But if the cardinality grows, the performance decreases. In order to provide an efficient way to join inputs with big cardinality, Tamino features the index-only join processing. This approach tries to join index entries instead of joining XML documents.

The following query shows a typical use case of the index-only join processing:

for   $b in input()/vendor,
    $a in input()/customer
    where $b/@vno = $a/vno 
return
    <customers_of_vendor>
     { $b/vname  }
     { $b/@vno  }
     { $a/custname }
    </customers_of_vendor>

The query joins the two doctypes "vendor" and "customer". The documents of the "vendor" doctype are not filtered by any predicate that can be exploited for creating an index access predicate. Moreover, the join predicate is an equal comparison on elements and attributes that occurs exactly once per document. By just joining the index entries, the expensive reading of a big number of documents can be avoided. This is particular true if you do not retrieve all results, but the first n or last n results by applying a position range filter:

(
for    $b in input()/vendor,
       $a in input()/customer
       where $b/@vno = $a/vno 
     return
    <customers_of_vendor>
     { $b/vname  }
     { $b/@vno  }
     { $a/custname }
    </customers_of_vendor>
)
[position() > last() - 10]

The following list specifies the restrictions for join operations that can be processed index-only:

  • Join is applied on expressions that retrieve a single node per document.

  • Join predicates have to be specified in the where clause.

  • Join predicates must be equal comparisons.

If a query does not satisfy the restrictions, join processing falls back to index-based nested-loop approach. Albeit the index-only join processing provides a tremendous performance improvement for a lot of queries, it is not always better than doing index-based nested-loop join. For a certain query it cannot be decided which approach is the better one by the Tamino XQuery processor. Therefore, the user can specify whether or not index-only processing is the appropriate join method. The user-switch to activate the index-only join processing is provided by the "optimization" of the Tamino XQuery pragma via the "join" parameter. The parameter can be set to "index-only" or "default" and activates or de-activates index-only join processing. This means, to activate index-only joining for our example query, it has to be stated like this:

{?optimization join="index-only"?}
(
for  $b in input()/vendor,
     $a in input()/customer
     where $b/@vno = $a/vno 
return
    <customers_of_vendor>
     { $b/vname  }
     { $b/@vno  }
     { $a/custname }
    </customers_of_vendor>
)
[position() > last() - 10]

Index-based Processing of Aggregation Function

Tamino provides index-based processing of aggregate functions like min(), max(), count() and distinct-values().

Min and Max

The minimum and the maximum value of an element or attribute can be determined from a defined standard index. For example assuming the following query that retrieves the latest published books from a bibliography:

let $m := max(input()/bib/book/@year)
for $a in input()/bib/book
where $a/@year = $m
return $a

The latest publication year can be retrieved by accessing a standard index defined on the year attribute. The standard index on the year attribute is also used to retrieve all books that were published in the latest publication year.

Please note that the retrieval of the minimum and the maximum is not supported by a text index.

Count

In certain cases, a fn:count() expression can be optimized accessing an index or by just counting the documents in a doctype or collection. Assuming that in a bibliography each "bib" element contains exactly one "book" element, the following expression can be evaluated by just counting the number of documents in the "bib" doctype:

count(input()/bib/book) 

If there is a standard index on the "title" element, the following query can be evaluated by just counting the number of matching documents found by the search predicate:

count(for $b in input()/bib/book where $b/title = "Data on the Web" return $b)

Distinct-values

A call of the distinct-values() function can be optimized by accessing an existing standard index. For example, assuming that there is a standard index defined on the "title" element, the following query can be evaluated by reading the values from the index:

distinct-values(input()/bib/book/title)

The index-based optimization is not possible for elements and attributes of type xs:string that have a collation defined.

Function Inlining

In order to minimize the overhead of user-defined functions, inlining is needed. This means that a function call is replaced by the code of the called function. The advantages are:

  • Saving the overhead of function calls,

  • Enabling optimizations across function boundaries.

Function inlining is not possible for recursive function calls. But due to the fact that inlining blows up the code of the calling query or function, it also might be problematic for non recursive functions. Since it is hard to find an optimal inlining strategy, the user is able to control the inlining behavior via the "inline" parameter of the "optimization" XQuery pragma. The argument of the "inline" parameter specifies the inlining strategy. The following strategies are available:

  • None

  • Default

  • Full

The "default" strategy is in charge if no inlining is specified. The following query shows an example of how to apply "full" inlining to a query:

{?optimization inline="full"?}
import module namespace math = "http://example.org/math-functions";
math:power(2,2)

The None Inlining Strategy

If the value of the inlining parameter is "none", no inlining is performed at all.

The Default Inlining Strategy

If the value of the inlining parameter is "default", only those functions are inlined that have an "inline" hint. An "inline" hint is an XQuery processing instruction that is in front of a function declaration. The following module declaration provides an example of how to apply the "inline" hint.

module namespace math = "http://example.org/math-functions";
{?inline?}
declare function math:power($b as xs:integer, $e as xs:integer) as xs:integer
{
            if($e <= 0) then 1
            else math:power($b,$e - 1) * $b
}

If inlining is not possible, the "inline" hint is ignored.

The Full Inlining Strategy

When adhering to the full inlining strategy, all user-defined functions are inlined except for those which directly or indirectly reference themselves.