Clusters with Sites

Although our recommended approach to deploying a cluster is a minimum of three locations and an odd number of nodes across the cluster, not all organizations have three physical locations with the required hardware. In terms of BCP (Business Continuity Planning), or DR (Disaster Recovery), organizations may follow a standard approach with just two locations: a primary and backup:

Two-realm cluster over two sites, using Universal Messaging Clusters with Sites.

With only two physical sites available, the quorum rule of 51% or more of cluster nodes being available is not reliably achievable, either with an odd or even number of realms split across these sites. For example, if you deploy a two-realm cluster, and locate one realm in each available location, then as soon as either location is lost, the entire cluster cannot function because of the 51% quorum rule:

Two-realm cluster over two locations: a 51% quorum is unachievable if one location/realm fails.

Similarly, if you deployed a three-node cluster with one realm in Location 1 and two in Location 2, and then lost access to Location 1, the cluster would still be available; if, however, you lost Location 2, the cluster would not be available since only 33% of the cluster's realms would be available.

This problem has been addressed through the introduction of Universal Messaging Clusters with Sites. The basic concept of Sites is that if only two physical locations are available, and Universal Messaging Clustering is used to provide High Availability and Disaster Recovery, it should be possible for one of those sites to function with less than 51% of the cluster available. This is achieved by allowing an additional vote to be allocated to either of the sites in order to achieve cluster quorum.

If, for example, you have two sites and each site contains just one realm, making a total of 2 realms in the cluster, the additional vote for one of the sites raises the availability from "one out of two" to "two out of three", i.e. 67%, so the required quorum of 51% is achieved.

As a further example, if you have two sites and each site contains two realms, making a total of 4 realms in the cluster, the additional vote for one of the sites raises the availability from "two out of four" to "three out of five", i.e. 60%, so the required quorum of 51% is achieved.

In a cluster with a production site and a disaster recovery site, you can make either site the prime site, but we recommend you to make the production site the prime site due to the following considerations (assuming a setup with equal numbers of realms on each site):

Prime site	Effect
production site	If the disaster recovery site fails, the production site continues to provide 50% of the available realms in the cluster. Since the production site is the prime site, it gets an additional vote, so the cluster can continue to run. If the production site fails, the disaster recovery site cannot take over automatically, since it cannot achieve the quorum of 51%. This situation requires manual intervention to set the disaster recovery site to be the prime site, so that the cluster can be restarted on the disaster recovery machine.
disaster recovery site	The idea behind setting the disaster recovery site to be the prime site is that if the production site fails, the disaster recovery site can achieve quorum and immediately take over from the production site. This may appear at first to be a good setup, but has a big disadvantage: If the production site is not the prime site, a failure on the disaster recovery site would cause the production site to halt, since the production site by itself cannot achieve a quorum of 51%. This is clearly not what a disaster recovery setup is intended for - a failure in the recovery machine shouldn't halt the production machine.

The general rule regarding the effect of sites when forming a cluster is as follows: If exactly 50% of all realms in the cluster are online, and if all prime site realms are contactable, a cluster can be formed. In all other cases, a cluster can only form if at least 51% of the cluster's realms are contactable. All other quorum and voting rules are identical, with or without sites.

Within the Universal Messaging Admin API, and specifically within a cluster node, you can define individual Site objects and allocate each realm within the cluster to one of these physical sites. Each defined site contains a list of its members, and a flag to indicate whether the site as a whole can cast an additional vote. This flag is known as the isPrime flag. When the isPrime flag is activated for a site, the site becomes the prime site.

Consider an example scenario where we have a cluster across two physical locations: the default production site and a disaster recovery site, and each site has one realm. Without using sites, this configuration wouldn't be able to satisfy the 51% quorum rule in the event of the loss of one location/realm. The same technique can be used for sites with as many realms as required.

In a disaster recovery situation, where the production site is lost, the disaster recovery site will achieve quorum with only one of the two nodes available because the isPrime flag provides an additional vote for the disaster recovery site.