Version 8.2.2
 —  Advanced Concepts  —

Indexing

Indexing is one of the principal features that distinguish a database from a simple file system. In a file system, searching for a specific element value usually requires scanning through the whole document set and parsing all the documents. A database system, in contrast, can organize much more efficient access paths by means of indexes.

Tamino can index attributes and simple elements as well as complex elements, and it allows two different categories of index: standard index and text index. Within these categories, various mechanisms can be used, which give rise to simple indexes, multi-path indexes, compound indexes and reference indexes.


Declaring an Index

Tamino's annotations (see Tamino Annotations in XMLSchema::Node Level Definitions) also indicate whether or not document nodes should be indexed, and which type of indexing is to be used:

Standard Indexing

Standard indexes are the classical database indexes. The complete node content is used as the index value. When a document is stored or updated, Tamino inserts standard indexes into a specific index, enabling efficient queries. Tamino can index all XML Schema datatypes.

Text Indexing

Here, Tamino analyzes the content of an indexed node word by word and stores the individual words in an index. We discuss this technique in more detail in Text Retrieval.

Standard and Text Indexing

This may be required in special cases but should, in general, be avoided because of the high overhead.

Indexes can be declared for attributes and for leaf elements, but also for elements that contain child elements. In the latter case the index is constructed from the concatenated text of the element and its child elements, i.e. from the result of the text() function applied to that element. For example, if we had:

<name><first>Louis</first><last>Armstrong</last></name>

and declared an index for <name>, the string "LouisArmstrong" would be used as index value.

The way in which the markup is treated in a given database can be controlled by setting the database property markup as delimiter. This can be done using the Tamino Manager. If the property is set to "yes", markup is treated as a delimiter for text (i.e. it is handled like white space). If it is set to "mixed", markup is not treated as a delimiter in mixed-content elements. If the property is set to the default value "no", markup is not treated as a delimiter.

Indexes are not necessarily required to perform queries. Tamino can interpret any valid query expression, even if an index is not defined. However, this can be costly. For example, the query

jazzMusician[@ID="ColtraneJohn"]

would require Tamino to read all jazzMusician instances, parse them to extract the attribute ID, and compare the ID attribute value of each document with the search string. This may be practicable if we have only a small document base with half a dozen documents, but is out of the question if our database contains thousands of jazzMusician instances.

In contrast, if the attribute ID is declared as an index, Tamino would only read and parse those jazzMusician documents that have "ColtraneJohn" as the value of the attribute ID.

So far, we have only discussed indexes for document nodes that are defined in the schema. But what about markup that is not defined in the schema but only appears in individual document instances, for example in a wildcard? Can Tamino apply indexing to these document extensions as well?

The answer is yes. The optional element tsd:physical/tsd:structureIndex in the tsd:doctype declaration allows Tamino to add nodes found in document instances to its internal repository, and thus to execute queries using such nodes much faster.

graphics/structindex.png

For example, the document instance:

<?xml version="1.0"?>
<jazzMusician
    xmlns="http://www.softwareag.com/tamino/doc/examples/models/jazz/encyclopedia"
    type="instrumentalist"
    ID="GillespieDizzy">
  <name>
    <first>Dizzy</first>
    <last>Gillespie</last>
  </name>
  <birthDate>1917-10-21</birthDate>
  <deathDate>1993-01-06</deathDate>
</jazzMusician>

contains an element that was not declared in the schema: deathDate. Without the structureIndex property, a query on deathDate would take a long time, but setting structureIndex to "FULL" or "CONDENSED" speeds up the query considerably. The difference between "FULL" and "CONDENSED" is that "CONDENSED" only remembers that an element deathDate is used in at least one instance of this document type. In contrast, "FULL" remembers in which documents such an element exists. This requires more memory but may make the query even faster when only a few documents contain this element.

The query jazzMusician[deathDate="1959-07-17"], for example, would have the following effect:

structureIndex="NO"

A full scan of all jazzMusician instances is required.

structureIndex="CONDENSED"

If no jazzMusician instance has a node jazzMusician/deathDate then the query can be answered quickly (no object returned). If at least one instance has a node jazzMusician/deathDate then it is still necessary to scan all jazzMusician instances in order to answer the query.

structureIndex="FULL"

Since many jazz musicians are still alive, the query can be answered faster because Tamino knows which jazzMusician instances have a node jazzMusician/deathDate. Only those instances are scanned.

structureIndex should be set to "NO" when we expect elements that contain random markup (e.g. XHTML). This would be the case for the description element in our style document type.

Note that after an initial set of indexes has been defined, it is possible to add or remove indexes subsequently by using update schema operations, based on the X-Machine command _define.

Top of page

Candidates for Indexes

Generally, we should declare the primary keys of all business objects as standard indexes. In our case, these are the attributes @ID for the business objects jazzMusician, critic, and collaboration, @name for business object style, and @albumNo for business object album. The only business object that does not require an explicit index is review, because we access review objects via URL.

In addition, foreign keys are also prime candidates for indexes. For example, jazzMusician/influence/influences/@ID can be a useful index when we want to find out by whom a given jazz musician was influenced. For example:

jazzMusician[influence/influences/@ID="ParkerCharlie"]

would certainly result in an impressive list of jazzMusician documents, including, for example, those of John Coltrane and Miles Davis. Similarly, we could index jazzMusician/belongsTo/style/@name, too, in order to be able to quickly find all musicians belonging to a particular style.

The general rule is to index all those nodes which are frequently used as access paths for queries, and not to index nodes that are only rarely used in queries. Indexing speeds up queries but it also slows down insert, update and delete operations and consumes disk space. The conceptual model gives us a first hint for the possible access paths. However, later fine tuning (see also Efficient Querying) requires us to study the actual usage pattern of the database before making a final decision about which nodes to index.

Top of page

Composite Keys

Finally, we have to consider the case when a primary key is a composite, i.e. it consists of several nodes. For example, this would be the case if we use the node name as a primary key for business object jazzMusician. This node comprises the child elements first, middle, last. We can choose to create indexes for all of them or only for one or two. We could also use a Tamino compound index to integrate all three elements in a single index. In our case, it might be sufficient to create an index only for the element last, because this element can narrow down a search sufficiently. Querying for Pat Metheny with an XPath expression would then look like this:

jazzMusician[name/last="Metheny"
             and name/first="Pat"
             and not(name/middle)]

(Note that the expression name/middle="" would not work because it would require that the element middle exists.)

Top of page

Object Identity

Keys derived from document elements or attributes are only one of several methods to locate a document:

Top of page

Text Retrieval

Text retrieval is one of the facilities where Tamino's query languages X-Query and XQuery extend the functionality of the W3C standards XPath and XQuery. Text retrieval means that the content of a node is broken up into words or even word fragments. Each of these words or word fragments is treated as an individual key value and included in the index.

In an X-Query expression we can use the operator ~= (contains) to search for a word contained in a specific node.

A typical example for the use of text retrieval is the description element in the document type style. The query:

style[description~="question"]

finds all jazz styles that contain the word "question" in the element description. This query is always possible, regardless of whether or not we index description. However, if we specify description as a text index, the query is processed much more efficiently. Here is how a definition of the description element in the schema might look:

<xs:element name = "description">
  <xs:annotation>
    <xs:appinfo>
      <tsd:elementInfo>
        <tsd:physical>
          <tsd:native>
            <tsd:index>
              <tsd:text/>
            </tsd:index>
          </tsd:native>
        </tsd:physical>
      </tsd:elementInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:complexType>
    <xs:sequence>
      <xs:any processContents="skip"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

A corresponding document instance may contain the following element:

<description>
  <XHTML>
    To be-bop or not to be-bop, there is no question
  </XHTML>
</description>

The element was defined as an any element. The property processContents="skip" allows the element to contain any markup which is not checked against the schema definition (provided the document content model was defined as "open"). Here we have stored a simple XHTML text. The index property was defined as "text". Consequently, any word contained in the description would be added to the index and queries for words contained in the description can be answered efficiently with the help of the index.

The contains operator (~=) can be used in conjunction with the proximity operators adj and near. These operators can be used to search for adjacent words: adj requires the words in the specified sequence; near, in contrast, does not care about the specified sequence. For example, both:

style[description~="not" adj "to"]
style[description~="not" near "to"]

are successful;

style[description~= "to" near "not"]

is successful, whereas:

style[description~= "to" adj "not"]

fails.

In XQuery 4 we have similar possibilities. The functions tf:containsText, tf:containsAdjacentText and tf:containsNearText support text retrieval operations.

The contains operator is case insensitive, and it can also use wildcards to search for word fragments. For example, the search strings "*-bop", "BE-*", "B*p", "*e-bo*" would find all descriptions that contain "be-bop". However, the last search string "*e-bo*" only uses the index if the database parameter Word Fragment Index is set to "yes" (Tamino Manager). This option is set to "no" by default, because using the word fragment index causes substantial overhead (all possible word fragments must be extracted from words and stored as index values!).

Even without using the word fragment index, text indexes require more overhead than standard indexes. We should therefore use the text index facility only for those nodes that we really plan to access with the X-Query contains (~=) operator or the XQuery text retrieval operators.

Words that are very likely to be used as indexes should be defined in load lists. This can be done by adding a load list document to our database into collection ino:vocabulary. For example:

<?xml version="1.0" encoding="iso-8859-1" ?>
<ino:loadlist ino:loadlistname="myloadlist"
              xmlns:ino="http://namespaces.softwareag.com/tamino/response2">
  <ino:word>jazz</ino:word>
  <ino:word>blues</ino:word>
  <ino:word>swing</ino:word>
  <ino:word>ragtime</ino:word>
</ino:loadlist>

The required schema is already defined in Tamino. It is possible to define several load lists (with different names). When a database is started, Tamino concatenates all load lists stored in the database and pre-loads the words contained in them for the indexing to speed up the loading of documents.

Top of page