Indexing is one of the principal features that distinguish a database from a simple file system. In a file system, searching for a specific element value usually requires scanning through the whole document set and parsing all the documents. A database system, in contrast, can organize much more efficient access paths by means of indexes.
Tamino can index attributes and simple elements as well as complex elements, and it allows two different categories of index: standard index and text index. Within these categories, various mechanisms can be used, which give rise to simple indexes, multi-path indexes, compound indexes and reference indexes.
Tamino's annotations (see Tamino Annotations in XMLSchema::Node Level Definitions) also indicate whether or not document nodes should be indexed, and which type of indexing is to be used:
- Standard Indexing
Standard indexes are the classical database indexes. The complete node content is used as the index value. When a document is stored or updated, Tamino inserts standard indexes into a specific index, enabling efficient queries. Tamino can index all XML Schema datatypes.
- Text Indexing
Here, Tamino analyzes the content of an indexed node word by word and stores the individual words in an index. We discuss this technique in more detail in Text Retrieval.
- Standard and Text Indexing
This may be required in special cases but should, in general, be avoided because of the high overhead.
Indexes can be declared for attributes and for leaf elements, but also
for elements that contain child elements. In the latter case the index is
constructed from the concatenated text of the element and its child elements,
i.e. from the result of the text()
function applied to that
element. For example, if we had:
<name><first>Louis</first><last>Armstrong</last></name>
and declared an index for <name>
, the string
"LouisArmstrong" would be used as index value.
The way in which the markup is treated in a given database
can be controlled by setting the database property markup as
delimiter
. This can be done using the Tamino Manager. If the
property is set to "yes", markup is treated as a
delimiter for text (i.e. it is handled like white space). If it is set to
"mixed", markup is not treated as a delimiter in
mixed-content elements. If the property is set to the default value
"no", markup is not treated as a delimiter.
Indexes are not necessarily required to perform queries. Tamino can interpret any valid query expression, even if an index is not defined. However, this can be costly. For example, the query
jazzMusician[@ID="ColtraneJohn"]
would require Tamino to read all
jazzMusician
instances, parse them to extract the attribute
ID
, and compare the ID
attribute value of each
document with the search string. This may be practicable if we have only a
small document base with half a dozen documents, but is out of the question if
our database contains thousands of jazzMusician
instances.
In contrast, if the attribute ID
is declared as an index,
Tamino would only read and parse those
jazzMusician
documents that have
"ColtraneJohn" as the value of the attribute
ID
.
So far, we have only discussed indexes for document nodes that are defined in the schema. But what about markup that is not defined in the schema but only appears in individual document instances, for example in a wildcard? Can Tamino apply indexing to these document extensions as well?
The answer is yes. The optional element
tsd:physical/tsd:structureIndex
in the tsd:doctype
declaration allows Tamino to add nodes found in
document instances to its internal repository, and thus to execute queries
using such nodes much faster.
For example, the document instance:
<?xml version="1.0"?> <jazzMusician xmlns="http://www.softwareag.com/tamino/doc/examples/models/jazz/encyclopedia" type="instrumentalist" ID="GillespieDizzy"> <name> <first>Dizzy</first> <last>Gillespie</last> </name> <birthDate>1917-10-21</birthDate> <deathDate>1993-01-06</deathDate> </jazzMusician>
contains an element that was not declared in the schema:
deathDate
. Without the structureIndex
property, a
query on deathDate
would take a long time, but setting
structureIndex
to "FULL" or
"CONDENSED" speeds up the query considerably. The
difference between "FULL" and
"CONDENSED" is that
"CONDENSED" only remembers that an element
deathDate
is used in at least one instance of this document type.
In contrast, "FULL" remembers in which
documents such an element exists. This requires more memory but may make the
query even faster when only a few documents contain this element.
The query jazzMusician[deathDate="1959-07-17"]
, for
example, would have the following effect:
structureIndex="NO"
A full scan of all
jazzMusician
instances is required.structureIndex="CONDENSED"
If no
jazzMusician
instance has a nodejazzMusician/deathDate
then the query can be answered quickly (no object returned). If at least one instance has a nodejazzMusician/deathDate
then it is still necessary to scan alljazzMusician
instances in order to answer the query.structureIndex="FULL"
Since many jazz musicians are still alive, the query can be answered faster because Tamino knows which
jazzMusician
instances have a nodejazzMusician/deathDate
. Only those instances are scanned.
structureIndex
should be set to
"NO" when we expect elements that contain random
markup (e.g. XHTML). This would be the case for the description
element in our style
document type.
Note that after an initial set of indexes has been
defined, it is possible to add or remove indexes subsequently by using update
schema operations, based on the X-Machine command
_define
.
Generally, we should declare the primary keys of all business objects as
standard indexes. In our case, these are the attributes @ID
for
the business objects jazzMusician
, critic
, and
collaboration
, @name
for business object
style
, and @albumNo
for business object
album
. The only business object that does not require an explicit
index is review
, because we access review
objects via
URL.
In addition, foreign keys are also prime candidates for indexes. For
example, jazzMusician/influence/influences/@ID
can be a useful
index when we want to find out by whom a given jazz musician was influenced.
For example:
jazzMusician[influence/influences/@ID="ParkerCharlie"]
would certainly result in an impressive list of
jazzMusician
documents, including, for example, those of John
Coltrane and Miles Davis. Similarly, we could index
jazzMusician/belongsTo/style/@name
, too, in order to be able to
quickly find all musicians belonging to a particular style.
The general rule is to index all those nodes which are frequently used as access paths for queries, and not to index nodes that are only rarely used in queries. Indexing speeds up queries but it also slows down insert, update and delete operations and consumes disk space. The conceptual model gives us a first hint for the possible access paths. However, later fine tuning (see also Efficient Querying) requires us to study the actual usage pattern of the database before making a final decision about which nodes to index.
Finally, we have to consider the case when a primary key
is a composite, i.e. it consists of several nodes. For example, this would be
the case if we use the node name
as a primary key for business
object jazzMusician
. This node comprises the child elements
first
, middle
, last
. We can choose to
create indexes for all of them or only for one or two. We could also use a
Tamino compound index to integrate all three elements in a single index. In our
case, it might be sufficient to create an index only for the element
last
, because this element can narrow down a search sufficiently.
Querying for Pat Metheny with an XPath expression would then look like
this:
jazzMusician[name/last="Metheny" and name/first="Pat" and not(name/middle)]
(Note that the expression name/middle=""
would not work
because it would require that the element middle
exists.)
Keys derived from document elements or attributes are only one of several methods to locate a document:
URLs can be used to access any resource on the Internet. However, we should note that a URL does not establish an object's identity. A URL specifies a location and not an object. When a document is moved to another location, its URL changes.
When storing documents and other data in
Tamino, Tamino allows a
document name to be assigned to such an object. This name is stored in the
internal attribute ino:docname
. This method of object
identification is mostly used for non-XML documents stored in
Tamino, but it can, of course, also be used for XML
documents. For example, if we store a JPEG image in database jazz
in collection encyclopedia
under the name
dizzy.jpg, we can retrieve it with the following URL:
"http://localhost/tamino/jazz/encyclopedia/dizzy.jpg".
We are going to use this addressing mechanism for review
documents. Remember that we specified the review business object with a key
URL
of type xs:anyURI
. Instead of implementing an
explicit child element URL
, we use the ino:docname
feature. This allows us to access review
documents directly via
URL instead of using an XPath or XQuery 4 expression.
Generated Identifiers like the attribute ino:id
can be used to identify uniquely an object within a certain context.
ino:id
, for example, is generated by
Tamino when a new document is stored. It is unique
for the specific document type. It is, for example, used when an existing
document is replaced: in this case the ino:id
of the existing
document is specified in the new document instance. This causes
Tamino not to store the document under a new
ino:id
but instead to overwrite the document with the specified
ino:id
.
In general, the ino:id
should only be used for
programmatic access to a document but not as a "permanent"
reference to other documents; that is, the ino:id
of a document
should not be stored in other documents. Since the ino:id
is an
internally generated identifier, it may, for example, change when a document
subset is moved to another database. This could render the references invalid.
The same is true for internally generated identifiers of other database
systems.
Tamino allows documents to be located via
URLs consisting of database location, collection name, document type name and
@ino:id
:
"http://localhost/tamino/jazz/encyclopedia/jazzMusician/@33"
Primary Keys are derived from object properties, i.e. from the values of document elements and attributes. In the simplest case they are derived from a single value, either a simple content type element or an attribute, provided this value is unique.
Composite keys are unique values that are composed from several non-unique elements and attributes. We recommend avoiding composite keys derived from complex elements: as pointed out earlier (From Conceptual Model to Schema::Normalization), such keys are not robust against transformations. As shown above, they are also awkward to query. As we show later (Efficient Queries), composite keys are also detrimental to performance when they contain optional elements.
Keys are unique only in a given context, for example, in the context of a certain document type in a given Tamino database (see section Schema-Level Definitions). Keys do not establish a global object identity.
Tamino can locate documents by key via a URL consisting of database location, collection name, and query string: "http://localhost/tamino/jazz/encyclopedia? _XQL=jazzMusician[@ID="ColemanOrnette"]"
Globally Unique Identifiers are not derived from the element and attribute values of a document but are generated with a suitable algorithm. They can establish a unique object identity over the boundaries of a given database or server. Typical examples for such identifiers are the UUID (Universal Unique Identifier) and URI based identifiers.
UUIDs are used by several XML-based standards as global identifiers.
Each UUID is 128 bits long and consists usually of a 60-bit time stamp and a 48-bit network address (IEEE 802). Good generators can cater for cases when the hardware clock does not have the required resolution (by counting UUIDs with the same clock value). If a network address is not available, a random number is used instead. In this case it cannot be guaranteed that UUIDs are unique, although a conflict is very unlikely.
The advantage of UUIDs is that they are easy to generate and are quite
short. The disadvantage is that they are not friendly to the human eye:
<jazzMusician
ID="0076B468-EB27-42E5-AC09-9955CFF462A3">
.
Other XML-based standards such as XML Namespaces or XTM (XML Topic Maps) use domain name based identifiers. These identifiers consist of a registered domain name, an additional path expression that identifies the document type within the domain, and the primary key of that document type. Although they look syntactically very much like URLs, it is important to realize that their path specification does not define the actual location of an object but a virtual position in a semantic space. Therefore, these identifiers must also be stored as a key in the identified object:
<jazzMusician
ID="http://www.jazzServer.org/people/jazzMusician/EllingtonDuke">
.
The advantage of these identifiers is that they usually give the human reader some context information about the object. The disadvantage is that they can be quite long, and that the generation of these identifiers can require some bookkeeping.
Text retrieval is one of the facilities where Tamino's query languages X-Query and XQuery extend the functionality of the W3C standards XPath and XQuery. Text retrieval means that the content of a node is broken up into words or even word fragments. Each of these words or word fragments is treated as an individual key value and included in the index.
In an X-Query expression we can use the operator
~=
(contains) to search for a word contained in a specific
node.
A typical example for the use of text retrieval is the
description
element in the document type style
. The
query:
style[description~="question"]
finds all jazz styles that contain the word
"question" in the element description
.
This query is always possible, regardless of whether or not we index
description
. However, if we specify description
as a
text index, the query is processed much more efficiently. Here is how a
definition of the description
element in the schema might
look:
<xs:element name = "description"> <xs:annotation> <xs:appinfo> <tsd:elementInfo> <tsd:physical> <tsd:native> <tsd:index> <tsd:text/> </tsd:index> </tsd:native> </tsd:physical> </tsd:elementInfo> </xs:appinfo> </xs:annotation> <xs:complexType> <xs:sequence> <xs:any processContents="skip"/> </xs:sequence> </xs:complexType> </xs:element>
A corresponding document instance may contain the following element:
<description> <XHTML> To be-bop or not to be-bop, there is no question </XHTML> </description>
The element was defined as an any element. The property
processContents="skip"
allows the element to contain any markup
which is not checked against the schema definition (provided the document
content model was defined as "open"). Here we have
stored a simple XHTML text. The index property was defined as
"text". Consequently, any word contained in the
description would be added to the index and queries for words contained in the
description can be answered efficiently with the help of the index.
The contains
operator (~=
) can
be used in conjunction with the proximity operators adj
and
near
. These operators can be used to search for adjacent words:
adj
requires the words in the specified sequence;
near
, in contrast, does not care about the specified sequence. For
example, both:
style[description~="not" adj "to"]
style[description~="not" near "to"]
are successful;
style[description~= "to" near "not"]
is successful, whereas:
style[description~= "to" adj "not"]
fails.
In XQuery 4 we have similar possibilities. The functions
tf:containsText
, tf:containsAdjacentText
and
tf:containsNearText
support text retrieval operations.
The contains
operator is case insensitive, and it can also
use wildcards to search for word fragments. For example, the search strings
"*-bop", "BE-*",
"B*p", "*e-bo*" would
find all descriptions that contain "be-bop".
However, the last search string "*e-bo*" only uses
the index if the database parameter Word Fragment Index
is set to
"yes" (Tamino Manager).
This option is set to "no" by default, because using
the word fragment index causes substantial overhead (all possible word
fragments must be extracted from words and stored as index values!).
Even without using the word fragment index, text indexes
require more overhead than standard indexes. We should therefore use the text
index facility only for those nodes that we really plan to access with the
X-Query contains
(~=
) operator or the XQuery text
retrieval operators.
Words that are very likely to be used as indexes should be defined in
load lists. This can be done by adding a load list document to our
database into collection ino:vocabulary
. For example:
<?xml version="1.0" encoding="iso-8859-1" ?> <ino:loadlist ino:loadlistname="myloadlist" xmlns:ino="http://namespaces.softwareag.com/tamino/response2"> <ino:word>jazz</ino:word> <ino:word>blues</ino:word> <ino:word>swing</ino:word> <ino:word>ragtime</ino:word> </ino:loadlist>
The required schema is already defined in Tamino. It is possible to define several load lists (with different names). When a database is started, Tamino concatenates all load lists stored in the database and pre-loads the words contained in them for the indexing to speed up the loading of documents.