This document describes how to increase the performance of Tamino XQuery 4. The following topics are covered:
XQueries perform much faster if indexes can be used. If an element or
attribute is often referred to in a query, it is recommended to define a
standard index upon this item in the schema. If this item is moreover used for
text retrieval, as, for example, with the contains()
function, a
text index is even more appropriate.
An index - although defined in the schema - might not be used when
executing the query. This happens in cases where the connection between the
database items to be retrieved and the schema entry upon which the index is
defined is not recognized. Reasons for this may be an open-content schema or a
path expression in the query that contains element steps, thus passing element
definitions for which an anyType
content model is defined. So for
using indexing from within queries, try to keep the areas of uncertain
structure very limited, i.e. use close-content schemas and only allow
anyType
where absolutely necessary.
Constructors are very helpful to generate XML structures within a query. But used in the wrong way, constructors can degrade the query performance considerably.
First of all, you should avoid constructing XML structures that do not belong to the final query result. The following query gives an example for unnecessary XML construction:
for $a in input()/bib/book where for $b in $a/title where tf:containsText($b,"XQuery") return <true/> return $a
The constructor in the return clause of the nested sub-query creates a
true
element that just indicates the occurrence of a
node that satisfies the given search predicates. It is far better to return the
matching node:
for $a in input()/bib/book where for $b in $a/title where tf:containsText($b,"XQuery") return $b return $a
According to the XQuery draft, the result of an XML construction loses all the type information of the parts it has been constructed from. For example, the following query creates a book list where each entry contains a book element with two attributes holding the title and the publishing year:
for $a in input()/bib/book return <book title="{$a/title}" year="{$a/@year}"/>
To sort the result by the publishing year, a sort by clause can be
appended to the query. Since the year attribute of the constructed year
attribute has lost all of its type information, an xs:integer()
casting function has to be applied to get the correct ordering:
for $a in input()/bib/book return <book title="{$a/title}" year="{$a/@year}"/> sort by(xs:integer(./@year))
A more severe consequence of the construction is that the optimizer is
not able to perform an index-based sorting. This makes the above query very
slow if applied to a big document set. To get around this problem, the sort by
clause should be moved to the for
clause:
for $a in input()/bib/book sort by(./@year) return <book title="{$a/title}" year="{$a/@year}"/>
Due to the order preserving property of an FLWOR expression in XQuery, the ordering created by the sort by clause is maintained. With the newly introduced order by clause, the query can also be stated like this:
for $a in input()/bib/book sort by(./@year) order by $a/@year return <book title="{$a/title}" year="{$a/@year}"/>
Queries with complex search predicates involving "and" and "or" operations are difficult to evaluate efficiently. Most problematic are "or" operations, which may cause index access operations returning big result sets. The number of "or" operations can be reduced if they are applied on comparison predicates that read the same elements or attributes. For example, assuming that we want to get all the entries in a bibliography referencing a book written by Heinrich or Thomas Mann. The query can be stated like this:
for $b in input()/bib/book where $b/author[last = "Mann" and (first ="Heinrich" or first ="Thomas")] return $b
The search predicate of the query can be simplified in the following way:
for $b in input()/bib/book where $b/author[last = "Mann" and first =("Heinrich","Thomas")] return $b
The query compares the content of the "first" element with the sequence ("Heinrich","Thomas"). According to the definition of the general comparison in XQuery, this means that the comparison is successful if the content of the "first" element is either equal to "Heinrich" or "Thomas". The big advantage of using this kind of predicate is that only index access is performed during evaluation.
Tamino 4.4 introduces the index-based evaluation of negated predicates
in XQuery expressions. For negating expressions, XQuery provides the
fn:not()
function. For example, to find those books that do not
have any title with the value "Data on the Web" the
not()
function can be used as shown in the following query:
for $a in input()/bib/book where not($a/title = "Data on the Web") return $a
The following more complex example query retrieves all books except those that were not published in 2000 and written by Dan Suciu:
for $a in input()/bib/book where not($a/@year = 2000 and $a/author[first="Dan" and last="Suciu"]) return $a
Beside comparison predicates, the fn:not()
function is also
useful in combination with text-retrieval predicates. The following query that
retrieves all books that do not have the word "Web" in their title
illustrates this:
for $a in input()/bib/book where not(tf:containsText($a/title, "Web")) return $a
Also, the absence of elements or attributes can be checked via the not function:
for $a in input()/bib/book where not($a/author) return $a
The query finds all books with an empty author list. In summary, index-based evaluation for negated predicates is supported for the following predicate types:
Comparison predicates on elements and attributes
Text-retrieval predicates
Existence checks on elements and attributes
A value range query queries a value range by a conjunctive of two predicates affecting the same node. One predicate specifies a lower bound, and the other specifies an upper bound. For example, the following query finds all books published between 1996 and 2001.
for $a in input()/bib/book let $y := $a/@year where $y >= 1996 and $y<= 2001 return $a
Queries with value range predicates can be optimized by accessing a standard index. Therefore, the value range predicates have to be identified and propagated to a standard index access. For identifying a given conjunction of two comparisons as value range predicate, it has to be verified that both comparisons are acting on the same node. Therefore, the following restrictions hold:
Node expression must be a variable, variables must be equal for both comparisons.
If one comparison is a general one, the node delivered by the variable must not be multiple.
As a consequence of the given restrictions, the value range predicate of the example query can be optimized, if the year element is not multiple concerning the book element.
A position range predicate is a predicate qualifier that selects a range of items from a given input sequence. Therefore, it provides a lower and an upper bound for the position of the items that should be part of the result. The following query provides an example for a position range predicate:
(input()/bib/book) [position() >= 10 and position() < 20]
The query selects those "book" elements with a position greater than or equal to 10 and less than 20. Special range predicates result from implicit lower or upper bounds. The following query selects the first 10 books of the given doctype.
(input()/bib/book) [position() <= 10]
Here the lower bound is implicitly given. In the next query, the upper bound is implicitly given.
(input()/bib/book) [position() >= last() - 10]
The query selects the last ten books in the given doctype. The following list gives some examples of range predicates that can be optimized.
[position() < last() –10]
[position() > last() – 10]
[position () > 10 and position() < 21]
[position () < 21 and position() > 10]
Position range predicates are optimized by reducing the number of documents that have to be read during query execution. The prerequisite is that the cardinality of the document set that is read by scanning an index or a doctype is not changed during post-processing. This means that the query must not contain path expressions that deliver more or less than a single item for each document. Another point is that all search predicates must be completely processed using an index access. Assuming that the following assumptions are true:
Every document of the "bib" doctype holds exactly one
title
element.
A document of the "bib" doctype may hold more than a
single author
element.
There is a standard index on the title
element.
Examples of queries that can be optimized are given in the following list.
(input()/bib/book/title) [position() < last() –10]
(for $b in input()/bib/book where $b/title = "Data on the Web" return $b) [position() < last() –10]
(for $b in input()/bib/book where $b/author = "Dan Suciu" return $b) [position() < last() –10]
Examples for queries that cannot be optimized are:
(input()/bib/book/author) [position() < last() –10]
(let $b := input()/bib/book where $b/title = "Data on the Web" return $b) [position() < last() –10]
Due to XQuery’s ordered data model, the order of "for" clauses in a FLWOR expression is relevant. This means Tamino does not change the order of "for" clauses in a FLWOR expression. Thus you should reorder the "for" clauses manually, if the result order does not matter. A rule of thumb is to start with "for" clauses that iterate expressions that produce small results and that can be evaluated via an index access. For example:
Instead of
for $a in input()/customer, $b in input()/vendor where $b/@vno = $a/vno and $b/vname = "Schmidt" return <customers_of_vendor> { $b/vname } { $b/@vno } { $a/custname } </customers_of_vendor>
Use:
for $b in input()/vendor $a in input()/customer, where $b/@vno = $a/vno and $b/vname = "Schmidt" return <customers_of_vendor> { $b/vname } { $b/@vno } { $a/custname } </customers_of_vendor>
This is because the first variable $a in the first example has to be assigned to all customer documents, whereas the variable $b in the second example has to be assigned only to the vendors with a specific "vname". This is much more efficient assuming there is a standard index on "vname".
Join queries are evaluated by the Tamino XQuery in an index-based nested-loop manner. This works well if the input cardinality is small. But if the cardinality grows, the performance decreases. In order to provide an efficient way to join inputs with big cardinality, Tamino features the index-only join processing. This approach tries to join index entries instead of joining XML documents.
The following query shows a typical use case of the index-only join processing:
for $b in input()/vendor, $a in input()/customer where $b/@vno = $a/vno return <customers_of_vendor> { $b/vname } { $b/@vno } { $a/custname } </customers_of_vendor>
The query joins the two doctypes "vendor" and "customer". The documents of the "vendor" doctype are not filtered by any predicate that can be exploited for creating an index access predicate. Moreover, the join predicate is an equal comparison on elements and attributes that occurs exactly once per document. By just joining the index entries, the expensive reading of a big number of documents can be avoided. This is particular true if you do not retrieve all results, but the first n or last n results by applying a position range filter:
( for $b in input()/vendor, $a in input()/customer where $b/@vno = $a/vno return <customers_of_vendor> { $b/vname } { $b/@vno } { $a/custname } </customers_of_vendor> ) [position() > last() - 10]
The following list specifies the restrictions for join operations that can be processed index-only:
Join is applied on expressions that retrieve a single node per document.
Join predicates have to be specified in the where clause.
Join predicates must be equal comparisons.
If a query does not satisfy the restrictions, join processing falls back to index-based nested-loop approach. Albeit the index-only join processing provides a tremendous performance improvement for a lot of queries, it is not always better than doing index-based nested-loop join. For a certain query it cannot be decided which approach is the better one by the Tamino XQuery processor. Therefore, the user can specify whether or not index-only processing is the appropriate join method. The user-switch to activate the index-only join processing is provided by the "optimization" of the Tamino XQuery pragma via the "join" parameter. The parameter can be set to "index-only" or "default" and activates or de-activates index-only join processing. This means, to activate index-only joining for our example query, it has to be stated like this:
{?optimization join="index-only"?} ( for $b in input()/vendor, $a in input()/customer where $b/@vno = $a/vno return <customers_of_vendor> { $b/vname } { $b/@vno } { $a/custname } </customers_of_vendor> ) [position() > last() - 10]
Tamino provides index-based processing of aggregate functions like min(), max(), count() and distinct-values().
The minimum and the maximum value of an element or attribute can be determined from a defined standard index. For example assuming the following query that retrieves the latest published books from a bibliography:
let $m := max(input()/bib/book/@year) for $a in input()/bib/book where $a/@year = $m return $a
The latest publication year can be retrieved by accessing a standard index defined on the year attribute. The standard index on the year attribute is also used to retrieve all books that were published in the latest publication year.
Please note that the retrieval of the minimum and the maximum is not supported by a text index.
In certain cases, a fn:count()
expression can be optimized
accessing an index or by just counting the documents in a doctype or
collection. Assuming that in a bibliography each "bib" element
contains exactly one "book" element, the following expression can
be evaluated by just counting the number of documents in the "bib"
doctype:
count(input()/bib/book)
If there is a standard index on the "title" element, the following query can be evaluated by just counting the number of matching documents found by the search predicate:
count(for $b in input()/bib/book where $b/title = "Data on the Web" return $b)
A call of the distinct-values() function can be optimized by accessing an existing standard index. For example, assuming that there is a standard index defined on the "title" element, the following query can be evaluated by reading the values from the index:
distinct-values(input()/bib/book/title)
The index-based optimization is not possible for elements and attributes of type xs:string that have a collation defined.
In order to minimize the overhead of user-defined functions, inlining is needed. This means that a function call is replaced by the code of the called function. The advantages are:
Saving the overhead of function calls,
Enabling optimizations across function boundaries.
Function inlining is not possible for recursive function calls. But due to the fact that inlining blows up the code of the calling query or function, it also might be problematic for non recursive functions. Since it is hard to find an optimal inlining strategy, the user is able to control the inlining behavior via the "inline" parameter of the "optimization" XQuery pragma. The argument of the "inline" parameter specifies the inlining strategy. The following strategies are available:
None
Default
Full
The "default" strategy is in charge if no inlining is specified. The following query shows an example of how to apply "full" inlining to a query:
{?optimization inline="full"?} import module namespace math = "http://example.org/math-functions"; math:power(2,2)
If the value of the inlining parameter is "none", no inlining is performed at all.
If the value of the inlining parameter is "default", only those functions are inlined that have an "inline" hint. An "inline" hint is an XQuery processing instruction that is in front of a function declaration. The following module declaration provides an example of how to apply the "inline" hint.
module namespace math = "http://example.org/math-functions"; {?inline?} declare function math:power($b as xs:integer, $e as xs:integer) as xs:integer { if($e <= 0) then 1 else math:power($b,$e - 1) * $b }
If inlining is not possible, the "inline" hint is ignored.
When adhering to the full inlining strategy, all user-defined functions are inlined except for those which directly or indirectly reference themselves.