Integrity

In classical relational databases, integrity rules and triggers are used to maintain the integrity of the information stored in the database. Integrity means that the constraints defined in the conceptual model are not violated and that the data structures defined in the conceptual model are kept intact. This is possible by applying integrity rules and triggers within the same transactional context as the operations that modify the stored information.

Especially this last condition – the transactional context – becomes impossible to satisfy when we extend our data model beyond the boundaries of traditional enterprise databases. When a model includes data from sources somewhere on the World Wide Web, it becomes impossible for database systems to guarantee the integrity of data structures that span beyond the boundaries of the transactional environment. For example, a database cannot "lock" foreign web resources during a transaction, and thus cannot stop other users from interfering with that transaction.

On the other hand, web resources may be temporarily unavailable. And, increasingly, our hardware is becoming mobile, either as traveling PDAs, or in the form of wireless LANs. In these cases, it is not always possible to satisfy integrity constraints immediately, and instead of using transactional integrity techniques we need to use synchronization techniques to keep the data model consistent in the long term.

In general, the resource manager (i.e. the database) is the wrong instance for the enforcement of data integrity. In many cases this task is better left to the application logic, or to appropriate middleware.

In the following sections we indicate how constraints can be defined for XML documents. The method of choice in Tamino for implementing constraints is triggers. See the description of trigger functions in the chapter Tamino Server Extension Functions in the documentation for server extensions for details.


Simple Constraints

Constraints are used to add more meaning to a model. During the definition of the XML schemas we have already added a considerable set of constraints to our model: datatypes. Each datatype such as string, float or integer constrains the value domain of an element or attribute. Additional constraints are enumerations or type parameters (facets) such as totalDigits, maxLength, minExclusive, etc.

Another type of constraint is the cardinality constraint, which can be defined in schemas using minOccur and maxOccur. For example, by decorating the element

<xs:element name = "jazzMusician" type = "xs:string"
            minOccurs = "2" maxOccurs = "unbounded" >

in collaboration, we set up a constraint that a collaboration must consist of at least two jazz musicians. Actually, an element with no minOccur/maxOccur decoration at all has the strictest constraints: it requires a cardinality of 1..1. The weakest cardinality constraint is minOccurs = "0" maxOccurs = "unbounded" which leaves all possibilities open.

All these constraints can be checked by a validating parser. This happens, for example, when a document is inserted into or updated in Tamino.

Cross Field Constraints

What interests us in this context are constraints that affect more than one element or attribute. For example, we want to make sure that a jazz musician of type instrumentalist plays at least one instrument, whereas other types of jazz musicians (jazzComposer, jazzSinger) are not required to play an instrument. Here, the standard trigger functions of Tamino can be used to perform the constraint checking.

Constraints Across Documents

The document() function in XPath can be used to access multiple documents in a single query. This allows us to formulate constraints that span multiple documents. Let us assume that we have the following collaboration and jazzMusician documents stored in a Tamino database http://localhost/tamino/jazz/ in collection encyclopedia:

<?xml version="1.0"?>
<collaboration type="jamSession">
  <name>post-election jam</name>
    <jazzMusician>
      http://localhost/tamino/jazz/encyclopedia/dizzy.xml
    </jazzMusician>
    <jazzMusician>
      http://localhost/tamino/jazz/encyclopedia/parker.xml
    </jazzMusician>
    <performedAt>
      <location>Blues House</location>
      <time>1965-10-21T20:00:00</time>
    </performedAt>
</collaboration>
<?xml version="1.0"?>
<jazzMusician ID="ParkerCharlie" type="instrumentalist">
  <name>
    <first>Charlie</first>
    <last>Parker</last>
  </name>
  <birthDate>1920-08-19</birthDate>
</jazzMusician>
<?xml version="1.0"?>
<jazzMusician ID="GillespieDizzy" type="instrumentalist">
  <name>
    <first>Dizzy</first>
    <last>Gillespie</last>
  </name>
  <birthDate>1917-10-21</birthDate>
</jazzMusician>

We want to check that the performance date of the jam session is not earlier than the birth dates of its participants. We can achieve this with the following rule:

<rule context = "collaboration[@type='jamSession']/jazzMusician">
  <assert test = "number(translate(document(.)/*/birthDate,'1234567890-','1234567890')) &lt;
                  number(translate(substring(../performedAt/time,1,10),'1234567890-','1234567890'))">
    No jam for unborn child <value-of select="document(.)/*/name/last"/>!
  </assert>
</rule>

As we can see, the rule is executed in the context collaboration[@type='jamSession']/jazzMusician. The filter expression restricts the application of the rule to collaborations of type jamSession. The content of the element jazzMusician is used as a URL to locate the appropriate document (document(.)). From this document we fetch the element birthDate.

The translate() function removes the dashes from the ISO date string before the string is translated into a number. The same process is performed with the date part of element performedAt/time of the current document. Then both dates are compared using the operator &lt; (<). This rather clumsy process of translation and conversion into a number is necessary because XPath 1.0 does not support order relations between strings (strings can only be compared for equality) and, of course, XPath 1.0 does not support XML Schema datatypes. XPath 2.0 should improve this situation substantially.

To make the resulting report more informative, we include the name of the offending musician into the error message, too. This is done with the value-of clause.

Let us now assume that the collaboration document does not contain pointers (URLs) to the jazzMusician documents but instead identifies jazz musicians by their ID. This is what we actually want because usually URLs do not make good keys: they specify a location but do not identify a document.

<jazzMusician>GillespieDizzy</jazzMusician>
<jazzMusician>ParkerCharlie</jazzMusician>

We assume, too, that the documents are stored in Tamino. In this case we must replace all

document(.)/*

expressions with

document(concat('http://localhost/tamino/jazz/encyclopedia?_XQL=jazzMusician[@ID=&quot;',
              .,'&quot;]'))//jazzMusician

i.e., we construct an HTTP query to Tamino, such as:

http://localhost/tamino/jazz/encyclopedia?_XQL=jazzMusician[@ID="ParkerCharlie"]

and then extract the root node (jazzMusician) of the result document returned.

Data Integrity

Documents should only be written into the database after we have made sure that they do not violate the constraints imposed on them, i.e. that they comply with the application's business rules.

When a document is stored, Tamino checks the structural constraints and the datatype constraints defined in the document schema. This can be influenced by the content model definition for the document type. If the content model is set to "closed", Tamino only allows nodes that are defined in the document schema. Otherwise, Tamino allows additional nodes within a document instance.

Apart from that, as outlined above, other constraints may exist that cannot be appropriately described with XML Schema. Examples are cross-field constraints and cross-document constraints.

It is the application's responsibility to check for such constraints. In particular, the validation of cross-document constraints requires extra consideration for the transaction logic. To make the validation bulletproof, the validation and the following update must be performed in a single transaction with the isolation level set to "_shared" or "_protected". When doing so, we must apply the same guidelines for accessing multiple documents in one transaction as we outlined above in order to avoid deadlocks.

Unique Keys

Tamino's unique document key mechanism prevents users from storing (in a specific doctype) multiple documents with the same key. A key may be composed of one or more values of elements or attributes contained in the document. The unique document key mechanism monitors incoming documents according to specified constraints and prohibits the storage of these documents in a single document container (doctype) if a duplicate document key is identified. This is especially useful for the administration of user IDs and other IDs that have to be unique. Uniqueness can be set in the XML Schema for the document type.