Indexing

Indexing is one of the principal features that distinguish a database from a simple file system. In a file system, searching for a specific element value usually requires scanning through the whole document set and parsing all the documents. A database system, in contrast, can organize much more efficient access paths by means of indexes.

Tamino can index attributes and simple elements as well as complex elements, and it allows two different categories of index: standard index and text index. Within these categories, various mechanisms can be used, which give rise to simple indexes, multi-path indexes, compound indexes and reference indexes.


Declaring an Index

Tamino's annotations (see Tamino Annotations in XMLSchema::Node Level Definitions) also indicate whether or not document nodes should be indexed, and which type of indexing is to be used:

Standard Indexing

Standard indexes are the classical database indexes. The complete node content is used as the index value. When a document is stored or updated, Tamino inserts standard indexes into a specific index, enabling efficient queries. Tamino can index all XML Schema datatypes.

Text Indexing

Here, Tamino analyzes the content of an indexed node word by word and stores the individual words in an index. We discuss this technique in more detail in Text Retrieval.

Standard and Text Indexing

This may be required in special cases but should, in general, be avoided because of the high overhead.

Indexes can be declared for attributes and for leaf elements, but also for elements that contain child elements. In the latter case the index is constructed from the concatenated text of the element and its child elements, i.e. from the result of the text() function applied to that element. For example, if we had:

<name><first>Louis</first><last>Armstrong</last></name>

and declared an index for <name>, the string "LouisArmstrong" would be used as index value.

The way in which the markup is treated in a given database can be controlled by setting the database property markup as delimiter. This can be done using the Tamino Manager. If the property is set to "yes", markup is treated as a delimiter for text (i.e. it is handled like white space). If it is set to "mixed", markup is not treated as a delimiter in mixed-content elements. If the property is set to the default value "no", markup is not treated as a delimiter.

Indexes are not necessarily required to perform queries. Tamino can interpret any valid query expression, even if an index is not defined. However, this can be costly. For example, the query

jazzMusician[@ID="ColtraneJohn"]

would require Tamino to read all jazzMusician instances, parse them to extract the attribute ID, and compare the ID attribute value of each document with the search string. This may be practicable if we have only a small document base with half a dozen documents, but is out of the question if our database contains thousands of jazzMusician instances.

In contrast, if the attribute ID is declared as an index, Tamino would only read and parse those jazzMusician documents that have "ColtraneJohn" as the value of the attribute ID.

So far, we have only discussed indexes for document nodes that are defined in the schema. But what about markup that is not defined in the schema but only appears in individual document instances, for example in a wildcard? Can Tamino apply indexing to these document extensions as well?

The answer is yes. The optional element tsd:physical/tsd:structureIndex in the tsd:doctype declaration allows Tamino to add nodes found in document instances to its internal repository, and thus to execute queries using such nodes much faster.

graphics/structindex.png

For example, the document instance:

<?xml version="1.0"?>
<jazzMusician
    xmlns="http://www.softwareag.com/tamino/doc/examples/models/jazz/encyclopedia"
    type="instrumentalist"
    ID="GillespieDizzy">
  <name>
    <first>Dizzy</first>
    <last>Gillespie</last>
  </name>
  <birthDate>1917-10-21</birthDate>
  <deathDate>1993-01-06</deathDate>
</jazzMusician>

contains an element that was not declared in the schema: deathDate. Without the structureIndex property, a query on deathDate would take a long time, but setting structureIndex to "FULL" or "CONDENSED" speeds up the query considerably. The difference between "FULL" and "CONDENSED" is that "CONDENSED" only remembers that an element deathDate is used in at least one instance of this document type. In contrast, "FULL" remembers in which documents such an element exists. This requires more memory but may make the query even faster when only a few documents contain this element.

The query jazzMusician[deathDate="1959-07-17"], for example, would have the following effect:

structureIndex="NO"

A full scan of all jazzMusician instances is required.

structureIndex="CONDENSED"

If no jazzMusician instance has a node jazzMusician/deathDate then the query can be answered quickly (no object returned). If at least one instance has a node jazzMusician/deathDate then it is still necessary to scan all jazzMusician instances in order to answer the query.

structureIndex="FULL"

Since many jazz musicians are still alive, the query can be answered faster because Tamino knows which jazzMusician instances have a node jazzMusician/deathDate. Only those instances are scanned.

structureIndex should be set to "NO" when we expect elements that contain random markup (e.g. XHTML). This would be the case for the description element in our style document type.

Note that after an initial set of indexes has been defined, it is possible to add or remove indexes subsequently by using update schema operations, based on the X-Machine command _define.

Candidates for Indexes

Generally, we should declare the primary keys of all business objects as standard indexes. In our case, these are the attributes @ID for the business objects jazzMusician, critic, and collaboration, @name for business object style, and @albumNo for business object album. The only business object that does not require an explicit index is review, because we access review objects via URL.

In addition, foreign keys are also prime candidates for indexes. For example, jazzMusician/influence/influences/@ID can be a useful index when we want to find out by whom a given jazz musician was influenced. For example:

jazzMusician[influence/influences/@ID="ParkerCharlie"]

would certainly result in an impressive list of jazzMusician documents, including, for example, those of John Coltrane and Miles Davis. Similarly, we could index jazzMusician/belongsTo/style/@name, too, in order to be able to quickly find all musicians belonging to a particular style.

The general rule is to index all those nodes which are frequently used as access paths for queries, and not to index nodes that are only rarely used in queries. Indexing speeds up queries but it also slows down insert, update and delete operations and consumes disk space. The conceptual model gives us a first hint for the possible access paths. However, later fine tuning (see also Efficient Querying) requires us to study the actual usage pattern of the database before making a final decision about which nodes to index.

Composite Keys

Finally, we have to consider the case when a primary key is a composite, i.e. it consists of several nodes. For example, this would be the case if we use the node name as a primary key for business object jazzMusician. This node comprises the child elements first, middle, last. We can choose to create indexes for all of them or only for one or two. We could also use a Tamino compound index to integrate all three elements in a single index. In our case, it might be sufficient to create an index only for the element last, because this element can narrow down a search sufficiently. Querying for Pat Metheny with an XPath expression would then look like this:

jazzMusician[name/last="Metheny"
             and name/first="Pat"
             and not(name/middle)]

(Note that the expression name/middle="" would not work because it would require that the element middle exists.)

Object Identity

Keys derived from document elements or attributes are only one of several methods to locate a document:

  • URLs can be used to access any resource on the Internet. However, we should note that a URL does not establish an object's identity. A URL specifies a location and not an object. When a document is moved to another location, its URL changes.

    When storing documents and other data in Tamino, Tamino allows a document name to be assigned to such an object. This name is stored in the internal attribute ino:docname. This method of object identification is mostly used for non-XML documents stored in Tamino, but it can, of course, also be used for XML documents. For example, if we store a JPEG image in database jazz in collection encyclopedia under the name dizzy.jpg, we can retrieve it with the following URL: "http://localhost/tamino/jazz/encyclopedia/dizzy.jpg".

    We are going to use this addressing mechanism for review documents. Remember that we specified the review business object with a key URL of type xs:anyURI. Instead of implementing an explicit child element URL, we use the ino:docname feature. This allows us to access review documents directly via URL instead of using an XPath or XQuery 4 expression.

  • Generated Identifiers like the attribute ino:id can be used to identify uniquely an object within a certain context. ino:id, for example, is generated by Tamino when a new document is stored. It is unique for the specific document type. It is, for example, used when an existing document is replaced: in this case the ino:id of the existing document is specified in the new document instance. This causes Tamino not to store the document under a new ino:id but instead to overwrite the document with the specified ino:id.

    In general, the ino:id should only be used for programmatic access to a document but not as a "permanent" reference to other documents; that is, the ino:id of a document should not be stored in other documents. Since the ino:id is an internally generated identifier, it may, for example, change when a document subset is moved to another database. This could render the references invalid. The same is true for internally generated identifiers of other database systems.

    Tamino allows documents to be located via URLs consisting of database location, collection name, document type name and @ino:id: "http://localhost/tamino/jazz/encyclopedia/jazzMusician/@33"

  • Primary Keys are derived from object properties, i.e. from the values of document elements and attributes. In the simplest case they are derived from a single value, either a simple content type element or an attribute, provided this value is unique.

    Composite keys are unique values that are composed from several non-unique elements and attributes. We recommend avoiding composite keys derived from complex elements: as pointed out earlier (From Conceptual Model to Schema::Normalization), such keys are not robust against transformations. As shown above, they are also awkward to query. As we show later (Efficient Queries), composite keys are also detrimental to performance when they contain optional elements.

    Keys are unique only in a given context, for example, in the context of a certain document type in a given Tamino database (see section Schema-Level Definitions). Keys do not establish a global object identity.

    Tamino can locate documents by key via a URL consisting of database location, collection name, and query string: "http://localhost/tamino/jazz/encyclopedia? _XQL=jazzMusician[@ID="ColemanOrnette"]"

  • Globally Unique Identifiers are not derived from the element and attribute values of a document but are generated with a suitable algorithm. They can establish a unique object identity over the boundaries of a given database or server. Typical examples for such identifiers are the UUID (Universal Unique Identifier) and URI based identifiers.

    UUIDs are used by several XML-based standards as global identifiers.

    Each UUID is 128 bits long and consists usually of a 60-bit time stamp and a 48-bit network address (IEEE 802). Good generators can cater for cases when the hardware clock does not have the required resolution (by counting UUIDs with the same clock value). If a network address is not available, a random number is used instead. In this case it cannot be guaranteed that UUIDs are unique, although a conflict is very unlikely.

    The advantage of UUIDs is that they are easy to generate and are quite short. The disadvantage is that they are not friendly to the human eye: <jazzMusician ID="0076B468-EB27-42E5-AC09-9955CFF462A3">.

    Other XML-based standards such as XML Namespaces or XTM (XML Topic Maps) use domain name based identifiers. These identifiers consist of a registered domain name, an additional path expression that identifies the document type within the domain, and the primary key of that document type. Although they look syntactically very much like URLs, it is important to realize that their path specification does not define the actual location of an object but a virtual position in a semantic space. Therefore, these identifiers must also be stored as a key in the identified object:

    <jazzMusician ID="http://www.jazzServer.org/people/jazzMusician/EllingtonDuke">.

    The advantage of these identifiers is that they usually give the human reader some context information about the object. The disadvantage is that they can be quite long, and that the generation of these identifiers can require some bookkeeping.

Text Retrieval

Text retrieval is one of the facilities where Tamino's query languages X-Query and XQuery extend the functionality of the W3C standards XPath and XQuery. Text retrieval means that the content of a node is broken up into words or even word fragments. Each of these words or word fragments is treated as an individual key value and included in the index.

In an X-Query expression we can use the operator ~= (contains) to search for a word contained in a specific node.

A typical example for the use of text retrieval is the description element in the document type style. The query:

style[description~="question"]

finds all jazz styles that contain the word "question" in the element description. This query is always possible, regardless of whether or not we index description. However, if we specify description as a text index, the query is processed much more efficiently. Here is how a definition of the description element in the schema might look:

<xs:element name = "description">
  <xs:annotation>
    <xs:appinfo>
      <tsd:elementInfo>
        <tsd:physical>
          <tsd:native>
            <tsd:index>
              <tsd:text/>
            </tsd:index>
          </tsd:native>
        </tsd:physical>
      </tsd:elementInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:complexType>
    <xs:sequence>
      <xs:any processContents="skip"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

A corresponding document instance may contain the following element:

<description>
  <XHTML>
    To be-bop or not to be-bop, there is no question
  </XHTML>
</description>

The element was defined as an any element. The property processContents="skip" allows the element to contain any markup which is not checked against the schema definition (provided the document content model was defined as "open"). Here we have stored a simple XHTML text. The index property was defined as "text". Consequently, any word contained in the description would be added to the index and queries for words contained in the description can be answered efficiently with the help of the index.

The contains operator (~=) can be used in conjunction with the proximity operators adj and near. These operators can be used to search for adjacent words: adj requires the words in the specified sequence; near, in contrast, does not care about the specified sequence. For example, both:

style[description~="not" adj "to"]
style[description~="not" near "to"]

are successful;

style[description~= "to" near "not"]

is successful, whereas:

style[description~= "to" adj "not"]

fails.

In XQuery 4 we have similar possibilities. The functions tf:containsText, tf:containsAdjacentText and tf:containsNearText support text retrieval operations.

The contains operator is case insensitive, and it can also use wildcards to search for word fragments. For example, the search strings "*-bop", "BE-*", "B*p", "*e-bo*" would find all descriptions that contain "be-bop". However, the last search string "*e-bo*" only uses the index if the database parameter Word Fragment Index is set to "yes" (Tamino Manager). This option is set to "no" by default, because using the word fragment index causes substantial overhead (all possible word fragments must be extracted from words and stored as index values!).

Even without using the word fragment index, text indexes require more overhead than standard indexes. We should therefore use the text index facility only for those nodes that we really plan to access with the X-Query contains (~=) operator or the XQuery text retrieval operators.

Words that are very likely to be used as indexes should be defined in load lists. This can be done by adding a load list document to our database into collection ino:vocabulary. For example:

<?xml version="1.0" encoding="iso-8859-1" ?>
<ino:loadlist ino:loadlistname="myloadlist"
              xmlns:ino="http://namespaces.softwareag.com/tamino/response2">
  <ino:word>jazz</ino:word>
  <ino:word>blues</ino:word>
  <ino:word>swing</ino:word>
  <ino:word>ragtime</ino:word>
</ino:loadlist>

The required schema is already defined in Tamino. It is possible to define several load lists (with different names). When a database is started, Tamino concatenates all load lists stored in the database and pre-loads the words contained in them for the indexing to speed up the loading of documents.