Advanced Indexes

In addition to the well-known standard and text indexes, Tamino offers the following advanced indexes:

  • unique keys

  • multipath indexes

  • computed indexes

  • compound indexes

  • reference indexes

The impact on performance of these indexes is discussed in this document. You should be familiar with the syntax and concepts of these indexes as described in the Tamino XML Schema Reference Guide and the Tamino XML Schema User Guide. The information is organized under the following topics:


General Considerations

The purpose of indexes is to improve query performance. However, this is done at the disadvantage of a higher disk space consumption and a higher effort when documents are inserted, modified or deleted. Thus it should be thoroughly considered if it is really necessary to create an index (which means there are enough queries that can benefit from the index) and whether the disadvantages can be tolerated.

Unique Keys

From a logical point of view, a unique key is just an assertion: Tamino guarantees that each value of a unique key appears only once within the doctype. Internally, Tamino uses an index for each unique key in order to easily keep track of the already existing values. In addition to their main task of duplicate detection, these indexes are also used during query evaluation. Hence, unique keys can improve query performance.

If a unique key is defined with one component field, a standard index will be created for that field. If there are several fields, a compound index will be created at the root element. The example below shows a schema with unique key definitions, followed by a schema that shows the indexes created by Tamino (note that the original schema is not modified by Tamino, the second schema is just shown to illustrate the index creation). Hence, from a performance point of view, a unique key behaves either like a standard index or like a compound index.

Schema with unique key definitions:

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:tsd = "http://namespaces.softwareag.com/tamino/TaminoSchemaDefinition"
 xmlns:xs = "http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "unique">
        <tsd:collection name = "MyCollection"></tsd:collection>
        <tsd:doctype name = "A">
          <tsd:logical>
            <tsd:content>closed</tsd:content>
            <tsd:unique name = "simple-key">
              <tsd:field xpath = "D"></tsd:field>
            </tsd:unique>
            <tsd:unique name = "compound-key">
              <tsd:field xpath = "B/@b"></tsd:field>
              <tsd:field xpath = "C"></tsd:field>
            </tsd:unique>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "A">
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "B">
          <xs:complexType>
            <xs:simpleContent>
              <xs:extension>
                <xs:attribute name = "b" type = "xs:string" use = "required">
               </xs:attribute>
              </xs:extension>
            </xs:simpleContent>
          </xs:complexType>
        </xs:element>
        <xs:element name = "C" type = "xs:string"></xs:element>
        <xs:element name = "D" type = "xs:string"></xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Corresponding schema with indexes created by Tamino:

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:tsd = "http://namespaces.softwareag.com/tamino/TaminoSchemaDefinition"
 xmlns:xs = "http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "unique">
        <tsd:collection name = "MyCollection"></tsd:collection>
        <tsd:doctype name = "A">
          <tsd:logical>
            <tsd:content>closed</tsd:content>
            <tsd:unique name = "simple-key">
              <tsd:field xpath = "D"></tsd:field>
            </tsd:unique>
            <tsd:unique name = "compound-key">
              <tsd:field xpath = "B/@b"></tsd:field>
              <tsd:field xpath = "C"></tsd:field>
            </tsd:unique>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "A">
    <xs:annotation>
      <xs:appinfo>
        <tsd:elementInfo>
          <tsd:physical>
            <tsd:native>
              <tsd:index>
                <tsd:standard>
                  <tsd:field xpath = "B/@b"></tsd:field>
                  <tsd:field xpath = "C"></tsd:field>
                </tsd:standard>
              </tsd:index>
            </tsd:native>
          </tsd:physical>
        </tsd:elementInfo>
      </xs:appinfo>
    </xs:annotation>
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "B">
          <xs:complexType>
            <xs:simpleContent>
              <xs:extension base = "xs:string">
                <xs:attribute name = "b" type = "xs:string" use = "required">
               </xs:attribute>
              </xs:extension>
            </xs:simpleContent>
          </xs:complexType>
        </xs:element>
        <xs:element name = "C" type = "xs:string"></xs:element>
        <xs:element name = "D" type = "xs:string">
          <xs:annotation>
            <xs:appinfo>
              <tsd:elementInfo>
                <tsd:physical>
                  <tsd:native>
                    <tsd:index>
                      <tsd:standard></tsd:standard>
                    </tsd:index>
                  </tsd:native>
                </tsd:physical>
              </tsd:elementInfo>
            </xs:appinfo>
          </xs:annotation>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Note:
Although Tamino automatically creates indexes in order to implement unique key constraints, it is recommended to explicitly define the corresponding index in the schema if you rely on the performance improvement. Tamino will detect when an index definition matches a unique key constraint, and only one index will be created. The benefit is that such an explicitly defined index will survive if the unique key constraint is modified or removed.

Multipath Index

A multipath index is an index that covers several paths: if each of those paths had its own index, the corresponding multipath index can be seen as the union of those indexes. As a feature, multipath is an add-on option for other indexes. It can be used with standard, compound, and text indexes. See the respective section in the Tamino XML Schema Reference Guide for detailed rules about creating a multipath index.

The multipath feature supports queries in the following scenarios:

  • Highly-connected structures: Global elements or attributes with index are referenced from many places in the schema (which might become a problem as the number of distinct indexes is limited).

  • Recursive structures: Each occurrence of an element or attribute in a recursive structure is to be indexed.

  • Arbitrary path sets: Arbitrary path sets can be combined into one multipath index, if the rules apply (paths have to have the same type of index and the same data types).

The following examples illustrate these scenarios.

Highly-connected Structures

This example schema has several types of chapters, each of which has a title which is defined in a global element. The title has a text index, and instead of defining a separate index for each possible path, one common multipath index is defined which is used for any possible path to the Title element.

Example: Highly-connected Schema

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:tsd = "http://namespaces.softwareag.com/tamino/TaminoSchemaDefinition"
 xmlns:xs = "http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "highly-connected">
        <tsd:collection name = "MyCollection"></tsd:collection>
        <tsd:doctype name = "Document">
          <tsd:logical>
            <tsd:content>closed</tsd:content>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "Title" type = "xs:string">
    <xs:annotation>
      <xs:appinfo>
        <tsd:elementInfo>
          <tsd:physical>
            <tsd:native>
              <tsd:index>
                <tsd:text>
                  <tsd:multiPath>allTitlesIndex</tsd:multiPath>
                </tsd:text>
              </tsd:index>
            </tsd:native>
          </tsd:physical>
        </tsd:elementInfo>
      </xs:appinfo>
    </xs:annotation>
  </xs:element>
  <xs:element name = "Document">
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "Chapter1">
          <xs:complexType>
            <xs:sequence>
              <xs:element ref = "Title"></xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name = "Chapter2">
          <xs:complexType>
            <xs:sequence>
              <xs:element ref = "Title"></xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name = "Chapter3">
          <xs:complexType>
            <xs:sequence>
              <xs:element ref = "Title"></xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

This multipath index is used in an optimal way by queries like the following (the first example query uses XQuery syntax, followed by the same example in X-Query syntax respectively):

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
where tf:containsText ($d//Title, "some text")
return $d
_XQL = /Document[.//Title ~= "some text"]

It finds all documents where an arbitrary title, regardless of its path, fulfils the search criterion. The result is found by performing one index lookup. Without the multipath index, there has to be a separate index for each path, and the result of several index lookups had to be combined by an OR operation.

The next example evaluates the criterion against one particular path:

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
where tf:containsText ($d/Chapter1/Title, "some text")
return $d
_XQL = /Document[Chapter1/Title ~= "some text"]

This query also makes use of the multipath index. But as the index has no knowledge about the path in which a particular value occurs, the index can only deliver a superset of the real result. From the viewpoint of the index, the criterion could be fulfilled by Chapter1 or Chapter2 or Chapter3. This superset has to be filtered by post-processing.

Recursive Structures

This example schema defines a chapter that has a title, and that contains a nested chapter. The title has a text index. Without a multipath index, there is no chance to index every possible nesting level. Using tsd:which, only a finite number of nesting levels can be explicitly indexed.

Recursive Schema

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:tsd = "http://namespaces.softwareag.com/tamino/TaminoSchemaDefinition"
 xmlns:xs = "http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "recursive">
        <tsd:collection name = "MyCollection"></tsd:collection>
        <tsd:doctype name = "Document">
          <tsd:logical>
            <tsd:content>closed</tsd:content>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "Document">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref = "Chapter"></xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name = "Chapter">
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "Title" type = "xs:string">
          <xs:annotation>
            <xs:appinfo>
              <tsd:elementInfo>
                <tsd:physical>
                  <tsd:native>
                    <tsd:index>
                      <tsd:text>
                        <tsd:multiPath>nestedTitlesIndex</tsd:multiPath>
                      </tsd:text>
                    </tsd:index>
                  </tsd:native>
                </tsd:physical>
              </tsd:elementInfo>
            </xs:appinfo>
          </xs:annotation>
        </xs:element>
        <xs:element ref = "Chapter" minOccurs = "0"></xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

The queries supported by this multipath index are very similar to the highly-connected scenario.

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
where tf:containsText ($d//Title, "some text")
return $d
_XQL = /Document[.//Title ~= "some text"]

This query finds all documents where an arbitrary title, regardless of its nesting level, fulfils the search criterion. The result is found by performing one index lookup. Without the multipath feature, this query can only be supported by indexes if every actually occurring nesting level of Title is explicitly indexed by a tsd:which statement.

The next example evaluates the criterion against one particular nesting level:

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
where tf:containsText ($d/Chapter/Chapter/Title, "some text")
return $d
_XQL = /Document[Chapter/Chapter/Title ~= "some text"]

This query also makes use of the multipath index. But as the index has no knowledge about the nesting level at which a particular value occurs, the index can only deliver a superset of the real result: from the viewpoint of the index, the criterion could be fulfilled by Chapter/Title or Chapter/Chapter/Title, and so on. This superset has to be filtered by post-processing.

Arbitrary Path Sets

The previous examples are based on the use of global elements (which is of course mandatory for recursion). The multipath feature, however, is not restricted to global elements. The following example shows a document that has an introduction with a subtitle, and two chapters with a title (where each title is modeled locally under its parent). Each of these three title definitions has its own multipath definition. As these definitions specify the same multipath label, the schema actually defines one multipath index, with three participating paths.

Example: Arbitrary Path Set

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:tsd = "http://namespaces.softwareag.com/tamino/TaminoSchemaDefinition"
 xmlns:xs = "http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "path-set">
        <tsd:collection name = "MyCollection"></tsd:collection>
        <tsd:doctype name = "Document">
          <tsd:logical>
            <tsd:content>closed</tsd:content>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "Document">
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "Introduction">
          <xs:complexType>
            <xs:sequence>
              <xs:element name = "Subtitle" type = "xs:string">
                <xs:annotation>
                  <xs:appinfo>
                    <tsd:elementInfo>
                      <tsd:physical>
                        <tsd:native>
                          <tsd:index>
                            <tsd:text>
                              <tsd:multiPath>allTitles</tsd:multiPath>
                            </tsd:text>
                          </tsd:index>
                        </tsd:native>
                      </tsd:physical>
                    </tsd:elementInfo>
                  </xs:appinfo>
                </xs:annotation>
              </xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name = "Chapter1">
          <xs:complexType>
            <xs:sequence>
              <xs:element name = "Title" type = "xs:string">
                <xs:annotation>
                  <xs:appinfo>
                    <tsd:elementInfo>
                      <tsd:physical>
                        <tsd:native>
                          <tsd:index>
                            <tsd:text>
                              <tsd:multiPath>allTitles
                              </tsd:multiPath>
                            </tsd:text>
                          </tsd:index>
                        </tsd:native>
                      </tsd:physical>
                    </tsd:elementInfo>
                  </xs:appinfo>
                </xs:annotation>
              </xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name = "Chapter2">
          <xs:complexType>
            <xs:sequence>
              <xs:element name = "Title" type = "xs:string">
                <xs:annotation>
                  <xs:appinfo>
                    <tsd:elementInfo>
                      <tsd:physical>
                        <tsd:native>
                          <tsd:index>
                            <tsd:text>
                              <tsd:multiPath>allTitles
                             </tsd:multiPath>
                            </tsd:text>
                          </tsd:index>
                        </tsd:native>
                      </tsd:physical>
                    </tsd:elementInfo>
                  </xs:appinfo>
                </xs:annotation>
              </xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Queries similar to the following examples make use of the multipath index:

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
where    tf:containsText ($d/Introduction/Subtitle, "some text")
      or tf:containsText ($d//Title, "some other text")
return $d
_XQL = /Document[Introduction/Subtitle ~= "some text"
                 or .//Title ~= "some other text"]
declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
where    tf:containsText ($d/Introduction/Subtitle, "some text")
return $d
_XQL = /Document[Introduction/Subtitle ~= "some text"]
declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
where    tf:containsText ($d/Chapter1/Title, "some text")
return $d
_XQL = /Document[Chapter1/Title ~= "some text"]
declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
where     tf:containsText ($d/Chapter1/Title, "some text")
      and tf:containsText ($d/Chapter2/Title, "some other text")
return $d
_XQL = /Document[Chapter1/Title ~= "some text"
                 and Chapter2/Title ~= "some other text"]

In all these cases, post-processing is required to filter the result of the index scan. The reason is again that the index has no knowledge about the path in which a particular value occurs.

Computed Index

A computed index is even more powerful than a multipath indexes, with the current restriction that a computed index may be neither a text index nor a compound index. Instead of adding the index definition to all nodes (or paths) to be included in a multipath index, the computed index refers to an XQuery function which is defined in a module stored in Tamino via the QName of the XQuery function. This XQuery function may compute one or more index entries based on arbitrary nodes and their values in the XML document being stored in a doctype.

A computed index consists of:

  • an XQuery module defining the indexing function(s)

  • the schema defining the computed indexes referring to the indexing functions

  • an XQuery query taking advantage of the computed index by using the indexing function, for which the root node of each document will passed as an argument. The indexing function must be used either in a comparison or in an "order by" clause.

An indexing function must have the following signature:

  • Exactly one parameter of type "node()";

  • The return type is the QName of a known simple type; at the moment it must be a type predefined by XML Schema. Hence, a QName such as "xs:integer" might be specified, with an additional occurrence indicator such as "?" or "*". A return types such as "node()" or "item()" with an optional occurrence indicator is not acceptable.

The type attribute of tsd:computed, which is typically the same as the declared return type of the indexing function, must specify a simple type that is predefined in XML Schema.

For examples and additional aspects, please refer to the following documentation sections:

  • XML Schema User Guide > Appendix 5: Example Schemas for Indexing

  • XQuery User Guide > Advanced Usage > Defining and Using Modules

  • X-Machine Programming > Maintaining Tamino Indexes

  • Machine Programming > Requests using X-Machine Commands > _admin

Compound Index

A compound index combines values from different component fields into one index value. The following schema has a Name element with Firstname, Initial, and Lastname children. There is a compound index located at the Name element, having Firstname, Initial, and Lastname as components (in that sequence).

Example: Compound Index

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:tsd = "http://namespaces.softwareag.com/tamino/TaminoSchemaDefinition"
 xmlns:xs = "http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "compound">
        <tsd:collection name = "MyCollection"></tsd:collection>
        <tsd:doctype name = "Document">
          <tsd:logical>
            <tsd:content>closed</tsd:content>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "Document">
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "Name" maxOccurs = "unbounded">
          <xs:annotation>
            <xs:appinfo>
              <tsd:elementInfo>
                <tsd:physical>
                  <tsd:native>
                    <tsd:index>
                      <tsd:standard>
                        <tsd:field xpath = "Firstname"></tsd:field>
                        <tsd:field xpath = "Initial"></tsd:field>
                        <tsd:field xpath = "Lastname"></tsd:field>
                      </tsd:standard>
                    </tsd:index>
                  </tsd:native>
                </tsd:physical>
              </tsd:elementInfo>
            </xs:appinfo>
          </xs:annotation>
          <xs:complexType>
            <xs:sequence>
              <xs:element name = "Firstname" type = "xs:string"></xs:element>
              <xs:element name = "Initial" type = "xs:string"></xs:element>
              <xs:element name = "Lastname" type = "xs:string"></xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Here are some example documents for this schema:

<Document>
  <Name>
    <Firstname>Paul</Firstname>
    <Initial>J</Initial>
    <Lastname>Bloggs</Lastname>
  </Name>
</Document>

<Document>
  <Name>
    <Firstname>Fred</Firstname>
    <Initial>M</Initial>
    <Lastname>Bloggs</Lastname>
  </Name>
  <Name>
    <Firstname>Paul</Firstname>
    <Initial>J</Initial>
    <Lastname>Atkins</Lastname>
  </Name>
</Document>

For the first document, the value (Paul,J,Bloggs) is added to the compound index; for the second document, the values (Fred,M,Bloggs) and (Paul,J,Atkins) are added (the tuple notation is used here only for readability purposes, internally Tamino uses a compact serialization format). The following query will make use of the compound index:

for $d in input()/Document
for $n in $d/Name
where     $n/Firstname = "Paul"
      and $n/Initial = "J"
      and $n/Lastname = "Bloggs"
return $d

_XQL = /Document[Name[    Firstname = "Paul"
                      and Initial = "J"
                      and Lastname = "Bloggs"] ]

This query finds the first document, the second one does not match because the values Paul and J appear under one Name element, and the value Bloggs under another. The query optimizer detects the compound index and scans the index for the value (Paul,J,Bloggs) which is composed from the parts given in the query. Thus, the query can be answered by one index lookup, although it consists of several criteria. Without a compound index, each component had to have its own standard index (in order to have an index-supported query), and several separate index lookups would be necessary.

Moreover, this example shows a much greater performance improvement than only saving index lookups. The criteria are:

  • The compound index is hosted by the Name element, which means that the compound values are built relative to Name,

  • and the Name element has a multiplicity greater than 1.

In other words, the values of the example compound index are grouped by Name elements. Without the compound index, when each component has its own standard index, there is no such grouping, and the index does not know to which occurrence of the Name element a particular value belongs. Thus, when executing the given query against three separate indexes, the index lookup will also find the second document (because all requested values appear somewhere in that document), and a subsequent postprocessing step is needed to find the correct result. This unnecessary reading of the second document is avoided with the compound index.

This first query example contains predicates for each component of the compound index. But the compound index can also be used if less predicates appear in the query. The rule is:

  • The set of predicates in the query has to refer to the components of the compound index from left to right (in definition sequence).

  • The predicates have to be connected by and.

  • The and operation must be in the scope of the location of the compound index (for example, with the compound index on the Name element, the and must combine paths relative to Name).

  • The predicates have to be "=" comparisons, with the exception of the last predicate in definition sequence which may be an arbitrary relational comparison operator.

The following query examples illustrate this rule. The first set of queries makes use of the compound index, and postprocessing is not necessary:

for $d in input()/Document
for $n in $d/Name
where     $n/Firstname = "Paul"
return $d
_XQL = /Document[Name[Firstname = "Paul"] ]
for $d in input()/Document
for $n in $d/Name
where     $n/Firstname > "Paul"
return $d
_XQL = /Document[Name[Firstname > "Paul"] ]
for $d in input()/Document
for $n in $d/Name
where     $n/Firstname = "Paul"
      and $n/Initial = "J"
return $d
_XQL = /Document[Name[    Firstname = "Paul"
                      and Initial = "J"] ]
for $d in input()/Document
for $n in $d/Name
where     $n/Initial = "J"
      and $n/Firstname = "Paul"
      and $n/Lastname < "Bloggs"
return $d

_XQL = /Document[Name[    Firstname = "Paul"
                      and Initial = "J"
                      and Lastname < "Bloggs"] ]

The next set of queries makes use of the compound index, but an additional postprocessing step is needed because the predicates do not fulfill the rule described above. The query optimizer selects those predicates that fulfill the rule in order to find a minimal superset of the final result using the compound index:

for $d in input()/Document
for $n in $d/Name
where     $n/Firstname = "Paul"
      and $n/Lastname = "Bloggs"
return $d
_XQL = /Document[Name[    Firstname = "Paul"
                      and Lastname = "Bloggs"] ]

for $d in input()/Document
for $n in $d/Name
where     $n/Firstname = "Paul"
      and $n/Initial > "J"
      and $n/Lastname > "Bloggs"
return $d
_XQL = /Document[Name[    Firstname = "Paul"
                      and Initial > "J"
                      and Lastname > "Bloggs"] ]

The following query cannot use the compound index because there is no predicate for the first component (Firstname):

for $d in input()/Document
for $n in $d/Name
where     $n/Initial = "J"
      and $n/Lastname = "Bloggs"
return $d
_XQL = /Document[Name[    Initial = "J"
                      and Lastname = "Bloggs"] ]

The following query cannot use the compound index because the and operation is not in the scope of the element hosting the compound index (the Name element):

for $d in input()/Document
where     $d/Name/Firstname = "Paul"
      and $d/Name/Initial = "J"
      and $d/Name/Lastname = "Bloggs"
return $d
_XQL = /Document[    Name/Firstname = "Paul"
                 and Name/Initial = "J"
                 and Name/Lastname = "Bloggs"]

Disk Space Considerations

Compound indexes should be used very carefully if one or even several of the components are multiple (relative to the element hosting the compound index), which means in the example above if a Name could consist of several Firstnames. In this case, all possible value combinations (the cross-product) are built and added to the index, so that the index can become very large.

Reference Index

A reference index consists of two parts:

  • The actual reference index (denoted by tsd:reference) is specified at a particular path in the schema. All document occurrences of that path are then assigned a node ID which is unique across the doctype.

  • Other indexes (standard, text, compound) located below the reference index can refer to that reference node by specifying tsd:refers.

Specifying a reference index makes sense only if

  • the reference node has a multiplicity greater than 1,

  • and there are at least two referencing indexes.

The schema used for compound indexes (simplified by leaving out the Initial element) is now reformulated using a reference index. Firstname has a text index, and Lastname has a standard index, both referring to the Name element:

Example: Reference Index

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:tsd = "http://namespaces.softwareag.com/tamino/TaminoSchemaDefinition"
 xmlns:xs = "http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "reference">
        <tsd:collection name = "MyCollection"></tsd:collection>
        <tsd:doctype name = "Document">
          <tsd:logical>
            <tsd:content>closed</tsd:content>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "Document">
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "Name" maxOccurs = "unbounded">
          <xs:annotation>
            <xs:appinfo>
              <tsd:elementInfo>
                <tsd:physical>
                  <tsd:native>
                    <tsd:index>
                      <tsd:reference></tsd:reference>
                    </tsd:index>
                  </tsd:native>
                </tsd:physical>
              </tsd:elementInfo>
            </xs:appinfo>
          </xs:annotation>
          <xs:complexType>
            <xs:sequence>
              <xs:element name = "Firstname" type = "xs:string">
                <xs:annotation>
                  <xs:appinfo>
                    <tsd:elementInfo>
                      <tsd:physical>
                        <tsd:native>
                          <tsd:index>
                            <tsd:text>
                              <tsd:refers>/Document/Name</tsd:refers>
                            </tsd:text>
                          </tsd:index>
                        </tsd:native>
                      </tsd:physical>
                    </tsd:elementInfo>
                  </xs:appinfo>
                </xs:annotation>
              </xs:element>
              <xs:element name = "Lastname" type = "xs:string">
                <xs:annotation>
                  <xs:appinfo>
                    <tsd:elementInfo>
                      <tsd:physical>
                        <tsd:native>
                          <tsd:index>
                            <tsd:standard>
                              <tsd:refers>/Document/Name</tsd:refers>
                            </tsd:standard>
                          </tsd:index>
                        </tsd:native>
                      </tsd:physical>
                    </tsd:elementInfo>
                  </xs:appinfo>
                </xs:annotation>
              </xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Here are two example documents.

<Document>
  <Name>
    <Firstname>Paul</Firstname>
    <Lastname>Bloggs</Lastname>
  </Name>
</Document>

<Document>
  <Name>
    <Firstname>Fred</Firstname>
    <Lastname>Bloggs</Lastname>
  </Name>
  <Name>
    <Firstname>Paul</Firstname>
    <Lastname>Atkins</Lastname>
  </Name>
</Document>

When these documents are stored, each Name element is assigned a unique ID, and the values for the other indexes are built as usual. The semantic of a referencing index, however, is different: while a classic index contains the information "the value 'Bloggs' appears in the document with ino:id 17", a reference index says "the value 'Bloggs' appears in the Name node with ID 5". Thus, a reference index achieves a grouping effect similar to the one described for compound indexes: the values Fred and Bloggs are grouped under the first Name node of the second document, and the values Paul and Atkins are grouped under the second Name node.

Queries can make use of this scenario if

  • there are predicates on the referencing index that are combined by an and,

  • and the and operator is in the scope of the reference index.

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
for $n in $d/Name
where     tf:containsText ($n/Firstname, "Paul")
      and $n/Lastname = "Bloggs"
return $d
_XQL = /Document[Name[    Firstname ~= "Paul"
                      and Lastname = "Bloggs"] ]

The index lookups on Firstname and Lastname and the subsequent intersection find the only Name node that fulfills the criteria, postprocessing is avoided. Without a reference index, the index lookup would find both documents (because the values Paul and Bloggs appear somewhere in both documents), and only postprocessing will find the correct result.

In such a scenario, the query performance is improved significantly because the and operation can be performed on the level of the Name element instead of the document level. On the Name element level, the intersection delivers already the final result, no document is read from disk only to be rejected by the postprocessor (which would happen without a reference index).

The following example queries make use of the reference index, but there is no performance benefit compared to classic indexes.

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
for $d in input()/Document
for $n in $d/Name
where     tf:containsText ($n/Firstname, "Paul")
return $d
_XQL = /Document[Name[Firstname ~= "Paul"]]

This query has only one predicate, thus there is no improvement because there is no intersection on the Name element level. Similarly, there would be no improvement if the query had several predicates combined with or.

The next query uses an and which is not in the scope of the Name element. The intersection is on the document level, and the correct result could also be found by classic indexes without postprocessing.

for $d in input()/Document
where     $d/Name/Firstname = "Paul"
      and $d/Name/Lastname = "Bloggs"
return $d
_XQL = /Document[    Name/Firstname = "Paul"
                 and Name/Lastname = "Bloggs"]

Actually, the latter examples should be avoided with a reference index. The index lookup of a referencing index (e.g. Firstname) delivers node IDs of the reference index (Name in this example). These node IDs have to be transformed to document IDs. This is unnecessary overhead if the same result can be achieved by classic indexes. In a "good" reference index scenario, this overhead also exists, but it is by far compensated by saving unnecessary document reads.

Reference Index versus Compound Index

Both reference index and compound index achieve performance improvements in more or less the same scenario where index values can be grouped relative to a particular node that has a multiplicity greater than 1.

Hence the question comes up which one should be preferred if both can be applied. The general recommendation is to use a compound index if it satisfies the query requirements. The reason is that a reference index needs more overhead, as described above.

But a compound index is not always feasible. A reference index is more flexible: It can work with all index types (while a compound index is always a standard index), and it can be nested (there may be several levels with tsd:reference).

Selectivity of Compound and Reference Index

As pointed out in the previous chapters, the performance improvement that can be achieved with compound and reference indexes heavily depends on the grouping of values relative to particular nodes (the tsd:reference node or the node at which the compound index is defined). The selectivity of a compound or reference index is much higher compared to classic standard indexes if these value groups identify a much smaller result set than without grouping.

In order to determine the selectivity improvement, two different count queries can be issued. The first one counts the number of documents that represents the query result:

{-- query based on the compound index example --}

count
(
  for $d in input()/Document
  for $n in $d/Name
  where     $n/Firstname = "Paul"
        and $n/Initial = "J"
        and $n/Lastname = "Bloggs"
  return $d
)
_XQL = count (/Document[Name   [ Firstname = "Paul"
                 and Initial = "J"
                 and Lastname = "Bloggs"] ] )
{-- query based on the reference index example --}

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
count
(
  for $d in input()/Document
  for $n in $d/Name
  where     tf:containsText ($n/Firstname, "Paul")
        and $n/Lastname = "Bloggs"
  return $d
)

_XQL = count(/Document[Name[    Firstname ~= "Paul"
                            and Lastname = "Bloggs"] ])

The second one counts the number of documents that had to be read if there was no reference or compound index, and which had then to be presented to the postprocessor:

{-- query based on the compound index example --}

count
(
  for $d in input()/Document
  where     $d/Name/Firstname = "Paul"
        and $d/Name/Initial = "J"
        and $d/Name/Lastname = "Bloggs"
  return $d
)

_XQL = count (/Document[    Name/Firstname = "Paul"
                 and Name/Initial = "J"
                 and Name/Lastname = "Bloggs"] )
{-- query based on the reference index example --}

declare namespace tf="http://namespaces.softwareag.com/tamino/TaminoFunction"
count
(
  for $d in input()/Document
  where     tf:containsText ($d/Name/Firstname, "Paul")
        and $d/Name/Lastname = "Bloggs"
  return $d
)
_XQL = count(/Document[    Name/Firstname ~= "Paul"
                       and Name/Lastname = "Bloggs"])

If these numbers differ significantly for a representative set of values, this is a good indication to define a compound or a reference index (depending on which one is feasible).