Conceptual Modeling

We now transform this informal description into a more formal conceptual model.

In traditional conceptual modeling (such as Entity Relationship Diagrams or Object Role Modeling), nouns would end up as entities (or attributes) and verbs would end up as relationships. However, in the context of modern information systems this approach can create more problems than it solves. For example: what should we do with Collaboration-collaborate or with reviews-Review? Should we use the verb or the noun? Which represents this concept better: an entity or a relationship?


Introducing Asset-Oriented Modeling

This ambiguity is one reason for taking a different approach and removing the artificial separation between entity and relationship: in our model, both nouns and verbs become Assets. What is new here is that the classic relationships are treated on the same level as entities: as things. This syntactical reification (reify = to make into a thing) leads to a considerable simplification of the conceptual model, and has some other advantages, too. In particular, it results in models that are easy to transform into XML.

When modeling nouns and verbs as assets, there are two notable exceptions:

  • The verb "has" indicates either that an asset is attributed with a property, as in:

    A person has a name and a birth date.

    or that an asset aggregates other assets, as in:

    A saxophone has a mouthpiece.
  • The expression "is a" indicates that an asset acquires properties from another asset, as in:

    A jazz musician is a person.

The noun on the left hand side (jazz musician) is usually a more specific term; the noun on the right hand side is usually a more general term (person).

Let us see how we can model a rather complex sentence like:

A jam session is performed at a location and at a particular time.

A possible resolution is to model jam_session, performance, and location as assets. The asset performance has a qualifying property: time. Usually we use the noun form (performance) of a verb (performed) to name the corresponding asset.

graphics/performance.png

Asset or Property?

In the most general sense, an asset is anything we can talk about. But for the purpose of modeling we want to categorize the things we can talk about into assets and properties.

  • In our informal description, properties are usually indicated by the verb "has": A saxophone has a color.

  • Anything which plays a certain role in the context of our business is definitely an asset.

Tip:
In many cases the distinction between a property and an asset can be made using a simple rule: A property can belong to an asset, but an asset cannot belong to a property. For example: a color cannot have a saxophone.

However, the distinction between both is not always so easy, and depends in some cases on the context. Take for example:

An instrument has a maker.
A saxophone has a mouthpiece.

In the context of our small knowledge base, maker and mouthpiece do not play any particular role. So it is acceptable to model them as properties of instrument or saxophone. However, in the context of a supply chain for musical instruments, maker and mouthpiece would certainly play a role (as manufacturer and product), so we would have to model them as assets.

An item that is only connected to a single asset (like maker) is always a candidate for becoming a property. In contrast, an item that also has other relationships must be modeled as an asset. Take for example:

A project can result in an album.

Here, album could be modeled as a property of project, if we had not specified:

An album is reviewed by critics.

Composite Properties

In contrast to classic Entity Relationship Diagrams, we allow composite properties. Above, we modeled performance and location (for example, Cotton Club, Savoy Ballroom, or Centralstation) as separate assets. But if these locations play no particular role in our business case, we can just use a composite property to represent these items.

For

A jam session is performed at a location and at a particular time.

we might define a property performed_at that includes the sub-properties location and time.

Similarly, the property name of asset person may include the sub-properties first, middle, and last. A mouthpiece may have sub-properties body and reed. This technique of composite (or nested) properties allows us to arrive at very compact models and – not surprisingly – results in very appropriate XML representations.

A Notation for Properties

We are now ready to introduce a more formal notation for assets.

The figure below shows the graphic representation of an asset. The first line contains the asset name. This is followed by a key definition (we discuss this later), a list of properties, and a list of constraints (also discussed later). The optional display label at the top of the asset is used when the names of asset instances should differ from the asset name. In this case the display label shows the possible names of asset instances. When an asset does not have instances (i.e. when the asset is abstract) the display label is grayed out.

graphics/jam.png

We introduce the following notation for properties:

Syntax Description Example
prop An atomic component without further structure.
birthDate
(...) A property particle, i.e. a structure consisting of several sub-properties. The parentheses contain nested expressions consisting of the following structures: See following rows.
(sub,...,sub) Sequence (ordered list).
name(first, middle?,last)

Here, we require that the sub-properties of name are always specified in the defined order. Queries can later rely on this order.

(sub&...&sub) Bag (unordered list).
reed(maker&grade) 

Here, we do not prescribe a particular sequence in which maker and grade must be specified. Queries cannot rely on an order relation between both.

(sub|...|sub) Choice (alternative).
(period(from,to) | performedAt(location&time))

A property is either a period or it is a performedAt property.

Both properties and particles can be suffixed with one of the following modifiers:

Syntax Description Example
(no modifier) mandatory [1..1]
last 

A last name is always required.

prop? optional [0..1]
middle?

Not everybody has a middle name, so we make this property optional.

prop+ repeated [1..n]
track+

An album has one or several tracks.

prop* optional and repeated [0..n]
album*

An arbitrary number of albums.

prop[n..m] a minimum of n occurrences and a maximum of m occurrences with 0 <= n <= m
track[1..25] 

The number of tracks is restricted to 25 at most.

The following notation allows recursively structured properties to be defined:

Syntax Description Example
label{...label...} A label defines a reference point to the expression within the curly braces. Later occurrences of the label are substituted with this expression. In particular, labels allow recursive structures to be defined.
r{part(partNo,r*)} 

specifies a tree-like structure of parts. Note that the *-modifier ensures that the recursive structure is finite.

Given this notation, we now define our complete model. We have added a few more properties.

graphics/jazz.png

A Notation for Arcs

Assets are connected via directed arcs. You should not misinterpret these arcs as classic relationships in the sense of Entity Relationship Diagrams. (Remember that relationships are assets too.) For this reason, arcs do not have names; however, the origin of an arc may be decorated with a role name.

Similar to the notation used for properties, we use XML syntax to denote the cardinality of each arc:

+ 1..n
* 0..n
? 0..1
[n..m] n..m (0<= n <= m)

Caution:
Avoid situations in which the constraints of the model can never be satisfied, for example:

graphics/cyclic.png

Here, each instance of asset type C requires the existence of at least two instances of asset type A and at most one instance of asset type B. This is in contradiction with the implicit constraint between B and A which dictates a 1:1 relation between both.

Important:
Therefore, we should always make sure that the intersection of all constraint cardinalities used within a cyclic structure is not empty. This is always the case when we only use the first three constraint types: their intersection contains always 1..1.

In addition to constraints, we may attribute the origin of each arc with a role name. Take for example the asset influence. This asset has two arcs that connect it with jazzMusician – one in the role of the influenced musician and one in the role of the musician who has influenced the first.

Inheritance

You will also have noticed that we have represented the is_a relationships not as assets but merely as arcs, the origin of the arc being decorated with the role name is_a. This makes it easier to identify such inheritance and classification relationships. In contrast to normal arcs, inheritance arcs may only be attributed with the "?" modifier indicating optional inheritance.

Special Cases

The has relationship (if it does not result in a property) results in a simple arc, too, pointing from the asset that "has" to the asset which is "had". This is, for example, the case for the relationship between jazzMusician and collaboration. By using the noun form (collaboration) of "collaborates", the sentence

A jazz musician collaborates with other jazz musicians.

is interpreted as

A collaboration has jazz musicians.

a relationship which is modeled with a simple arc.

A further design decision is to downgrade "plays" and "results_in" in

An instrumentalist plays one or several
instruments.

and

A project can result in an album.

We replace these relationships with the simple "has" relationship, too (i.e. "plays" and "can result" do not become assets). In order to retain the semantic information, we use "plays" and "result" as role names. This is not always easy to decide. If, for example, we had the relationship

An instrumentalist owns one or several instruments.

then we would probably model "owns" as a separate asset ownership because this asset could become the subject of a new business relation:

After four weeks ownership is transferred to the pawn broker.

Clusters

The asset review shows another interesting construct: a cluster. A cluster is used to denote alternatives, and is represented by a circle containing the choice operator. In our case the cluster says that a review relates either to an album or to a jazz musician. A cluster is a union of disjoint asset types. Clusters are possible for normal arcs and inheritance arcs.

Normalization

After we have obtained a first draft of our model, we should normalize it. Unlike relational technology, XML allows a physical data format that very closely follows the structures of the actual business data – there is no need to break complex information items into a multitude of "flat" tables. We shall find that an XML document can represent a conceptual entity almost unmodified.

This does not mean that no normalization is required. We still must make sure that our information model does not have redundancies, and that we end up with an implementation that is easy to maintain and consistently matches the "real world" relationships between information items. We make sure that:

  • Asset types are primitive, i.e. their properties do not contain assets that could be modeled as independent asset types. For example, the asset type album must not embed data from jazzMusician.

  • Asset types are minimal, i.e. they do not contain redundant properties, meaning none of their properties can be derived from other properties. For example, the asset type person must not contain a property age as this can be derived from birthDate.

  • Asset types must be complete, i.e. other assets contained in the real world scenario can be derived from the defined asset types. Our example is not complete, as we made no provision for solo albums. Our model contains only albums that are the result of a collaboration. Also, albums usually contain information about which musician played which instrument. This is not covered by our model.

  • Asset types must not be redundant, i.e. it must not be possible to derive any of the defined asset types from other asset types. In our example, we have a redundant asset. A band is a kind of project - the main difference is that it exists over a longer period of time and probably produces more albums. We could reflect this situation in our model by removing the properties name and period(from,to) from asset band, and by routing the is_a arc from band to project instead of collaboration.

  • All asset types must have a unique meaning.

  • Assets should have a key. Keys must be minimal, i.e. they must consist of the smallest set of properties that can uniquely identify an instance. In our example, not every asset has a key (for example, belongsTo, influence, collaboration and review don't have a key). We should introduce suitable keys for these assets. jazzMusician, instrumentalist, jazzSinger, jazzComposer, and critic do not need their own key, because they inherit one from person. If an asset type does not have suitable properties that can act as keys, we can easily equip them with some kind of a unique property (for example by generating a UUID for each instance).

Partitioned Normal Form

While the steps discussed above already result in a pretty robust model, there is one more thing we can do. Assets finally result in XML elements or documents, and can thus be subject to transformations (for example, via an XSLT stylesheet). To make the keys robust against such transformations, we should make sure that each asset is in Partitioned Normal Form (PNF).

An asset type or property is in Partitioned Normal Form (PNF) if the atomic properties of an asset constitute a key of the asset and all non-atomic properties and sub-properties are in Partitioned Normal Form themselves.

Or, in other words: All complex structures in the model (assets and complex properties) must have atomic child nodes that can act as a key.

In our example, the following asset types are not in PNF:

  • person, because the key name(first,middle?,last) is a composite. A solution would be to introduce a personal ID. Here, we opt to introduce an atomic ID composed from last name, middle name and first name, such as MingusCharles.

  • jamSession, because the key performedAt(time, location) is a composite. Here we opt for a different solution. We resolve the property performedAt into two independent properties: time and location. These two properties are atomic and can thus constitute a multi-field primary key that conforms to PNF.

  • saxophone, because the composite property mouthpiece is not in PNF (it has no atomic property which could act as key). Here we should rather remove the property mouthpiece and use mouthpiece_body and reed directly as parts of saxophone (because mouthpiece is in fact not a single physical entity).

In particular, if we plan to store assets in relational databases, PNF is essential. Relational technology requires fragmenting complex structures into flat relational tables. Keys that span complex structures would be lost during such a transformation to First Normal Form (1NF).

Determining Business Objects

Business objects are assets that play a prominent role in our scenario. In order to be able to identify a business object, we must not only have an idea about the structure of the information, but also what it will be used for.

In our example, all jazzMusician asset types, style, all collaboration asset types, album, review, and critic could be business object classes. Jazz musicians are clearly the most important topic in our knowledge base, but similarly important are style and the various collaborations. album could play a separate role when we connect our knowledge base with a mail order system. And review is probably an external resource to which we have to link via URL.

On the other hand, we made the decision not to model instrument as a separate business object. We are only interested here in the instrument that a given musician plays; we do not plan to set up a knowledge base about musical instruments as such. Consequently, we incorporate instrument and its subtypes into the jazzMusician business object.

We then group the remaining assets around the assets designated as business objects. Here, we have shown this by demarcating each business object with a labeled box. We use a bold outline for the identifying asset of each business object.

graphics/jazz2.png

However, there is one constraint that we must enforce when constructing business objects from assets:

Important:
Starting from the identifying asset of a business object, we must be able to reach any asset belonging to that business object by following the arcs in the indicated direction.

This constraint allows us to interpret each business object as an aggregation, and later allows us to easily implement the business objects in hierarchical XML documents.

When we check this constraint for our model, we encounter two problems: From the assets belongsTo and influence, both arrows lead to asset jazzMusician. This is bad, because when starting at jazzMusician we cannot reach belongsTo and influence.

In order to solve these problems, we simply reverse one arc for each of the assets belongsTo and influence. This results in a slightly different interpretation; we are now saying:

A jazzMusician has a "belonging" to a style.

and

A jazzMusician has influence.

By doing so, we have completed the former syntactic reification of relationships with a semantic reification – "belonging" and "influence" have become true assets of jazzMusician.

Caution:
When we reverse an arc, any cardinality constraint of that arc becomes invalid. Therefore, we always decorate reversed arcs with an asterisk (*) to indicate that there are no cardinality constraints for that arc. By doing so, however, we may lose some structural information.

In addition, we have taken the opportunity to fix some problems with keys. collaboration and review definitely need keys, because they are identifying assets of business objects. The identifying asset of a business object must always have a key, because otherwise instances of business object classes could become inaccessible. We have equipped collaboration with a new property, namely ID, which we use as a key. The reason is that the property name may not be unique.

Resolving is_a Relations

To prepare the model for implementation with XML, we resolve all is_a relations. Because DTDs and XML Schema do not really support inheritance, we have to find solutions for the various is_a relations. (DTDs do not have an inheritance mechanism at all; XML Schema cannot handle multiple inheritance.) We have the following options:

  • Separate implementation of parent and child. For example, we could implement separate person documents which would constitute a generic person data base. jazzMusician and critic instances would have to refer to these person instances.

  • Inclusion of parent properties in the child class. For example, we could include the properties of person into the asset types jazzMusician and critic.

  • Inclusion of child properties in the parent class. The child type would be stored in an extra property in the parent instance. For example, we could represent instrumentalist, jazzSinger, and jazzComposer in a generic document type jazzMusician and indicate the type of musician in a special property. However, we would suffer some information loss: because jazz singers and composers do not necessarily play an instrument, we would have to use the *-cardinality for the connection to instrument, and not the +-cardinality. The constraint that a instrumentalist must play at least one instrument would be lost. We would have to represent this through an extra constraint that depends on the type of musician. We shall later see how to formulate this sort of constraint.

  • As an further possibility, the is_a relations could be implemented similarly to an aggregation that would, for example, allow a jazz musician to be a composer, a singer, an instrumentalist or any combination of these. The problem of cardinality (* or +) would not appear if the instruments are only allowed in the context of the instrumentalist. This would be possible using the xs:all element of XML Schema, whereby jazzMusician could be an element whose schema definition contains an xs:all element that in turn contains elements Instrumentalist, Composer and Singer.

After applying these operations, our model could look like this:

graphics/jazz3.png

Here we have resolved the generic instrument asset into single instrument types such as saxophone, guitar, trombone, etc. The different instruments are just too different to be represented in one generic type. The consequence is that the asset type instrumentalist has a connection to all of these types. This is done with a cluster, a construct already discussed above.

Structurally, our conceptual model is now complete. In later chapters we discuss how additional constraints and operations can be defined. But before we do so, we discuss how to derive XML schemas from the conceptual model. We give a short introduction to XML Schema in the next section (Introduction to XML Schema).

Reverse Engineering of Relational Schemas

In some cases it is necessary to reengineer existing relational schemas. This is especially the case if we plan to convert existing relational data into XML or to map relational structures onto XML structures. If the original conceptual model is not available, we should try to reconstruct such a model from the relational schemas. This usually results in XML data structures of higher quality than the naive approach of mapping relational data directly onto XML.

Transforming relational schemas into an Asset Oriented Model is almost trivial:

  • Each table is mapped to an asset.

  • Each table column (except foreign keys) is mapped to an asset property.

  • Each corresponding foreign/primary key pair is represented as an arc pointing from the owner of the foreign key to the owner of the primary key.

The following is a classical example for a relational schema:

graphics/reverse.png

By applying the above rules, we arrive at the following asset-oriented model:

graphics/reverse2.png

Note that we cheated a bit here. We regrouped the three columns lastName, middleName, and firstName into a complex property called name. Relational schemas flatten complex data structures such as name (to achieve First Normal Form) and thereby lose structural information. Regrouping of such columns, however, needs an understanding of the semantics of the model and cannot be prescribed by simple rules.

In the next step, we determine our business objects and group the assets around them, as discussed above in the section Determining business objects.

graphics/reverse3.png

We have determined three business objects or business documents: Customer, Order and Product, and have grouped Item together with Order. Because the asset Order is the identifying asset of the business object Order we have reversed the arc between Order and Item to indicate that Item belongs to Order. Because we reversed this arc, we decorated it with an asterisk (see the section Determining business objects above). This makes sense: an order can have several items.

Using this technique, we finally arrive at well-structured XML documents representing not flat tables but complex business objects. One interesting detail is that with an implementation in XML the property position of asset Item becomes redundant. Sequences of XML elements are well ordered; rows in a relational table, in contrast, are not, and therefore require an attribute such as position in order to establish an ordered sequence. And, of course, orderNo in asset Item also becomes redundant because it is already contained in asset Order.

graphics/reverse4.png

Models and Namespaces

So far, we have not discussed how models are identified. A simple model name would be not a good choice because it probably would not be unique within a global context. A better idea is to use a URI as model identification, for example, a URI based on a domain name. In our case we could choose http://www.softwareag.com/tamino/doc/examples/models/jazz/encyclopedia for the jazz model, and http://www.softwareag.com/tamino/doc/examples/models/order/reengineered for the order model. This technique allows us to identify models uniquely. At the same time, such an identifier defines a default namespace for the respective model. Because asset names are unique within a model, the combination of namespace and local name identifies each asset uniquely within a global context.

Let us assume that we want to separate our jazz encyclopedia model into two models: one for the core jazz encyclopedia, and another model in which we define musical instruments. The idea behind this is that we could reuse the model for musical instruments in other contexts, too, for example in a knowledge base about classical music. We would establish a model for musical instruments under a separate namespace, for example, under http://www.softwareag.com/tamino/doc/examples/models/instruments. Our original jazz encyclopedia model could be reconstructed by merging the now instrument-less jazz encyclopedia model with the new instrument model.

And this is where the concept of namespaces really becomes important: when we start to merge models. Let us assume that we want to create a new model for a record shop where we want to sell jazz CDs on the Web. Instead of defining everything from scratch, we import the jazz encyclopedia model (which already imports the instrument model) and the order model. We then create a new asset, namely CD, which inherits its properties from both the album asset in the jazz model and the product asset in the order model.

graphics/jazz4.png

Model name: Record Shop
   
Namespaces: http://www.softwareag.com/tamino/doc/examples/models/jazz/shop
  e=http://www.softwareag.com/tamino/doc/examples/models/jazz/encyclopedia
  i=http://www.softwareag.com/tamino/doc/examples/models/instruments
  o=http://www.softwareag.com/tamino/doc/examples/models/order/reengineered

What happened here? The new model was defined with the default namespace http://www.softwareag.com/tamino/doc/examples/models/jazz/shop, which also identifies it uniquely. In addition, the model declares three namespace prefixes. The prefix "e" is assigned to the namespace of the jazz model, the prefix "i" is assigned to the namespace of the instrument model, and the prefix "o" is assigned to the namespace of the order model. All names of the assets imported from the jazz model are prefixed with "e:", all names of the assets imported from the instrument model are prefixed with "i:", and all names of the assets imported from the order model are prefixed with "o:".

What remains to do is to resolve the is_a relationships for the asset CD. This results in the following asset definition:

graphics/CD.png

Here, we have inherited properties across namespaces. The properties from both e:album and o:Product are incorporated into asset CD and belong now to the namespace of the record shop model.

What if we have name clashes between the inherited properties? Well, this problem is not specific to namespaces but can also occur during multiple inheritance in a single namespace. The conflict is resolved by combining the conflicting properties by intersection. If there are incompatible property definitions, the intersection is empty, and the property is discarded. Of course, it is always possible to override inherited properties locally.