Failover Tuning

In a clustered environment, network, hardware, or other failures can cause an active server to get disconnected from the rest of the servers in its stripe. When your cluster needs to remain tolerant to such failures, you have a choice to make: choose either consistency or availability but not both (CAP theorem). If consistency is chosen over availability, then the cluster will halt processing client requests as consistent reads/writes can't be guaranteed when the cluster is partitioned. But when availability is chosen over consistency, the cluster will respond to client requests even when the cluster is partitioned but the response is not guaranteed to be consistent. In the absence of such failures, the cluster can provide both consistency and availability.

If the cluster is tuned to favor availability over consistency, then when such failures happen, the behavior of a stripe is that the remaining passive servers will then run an election and, if not able to find the old active server, the passive server that wins the election becomes the new active server. While this configuration ensures high availability of the data, risks of experiencing a so-called split-brain situation during such elections are increased. In the case of a TSA, split-brain would be a situation in which multiple servers in a stripe are acting as active servers. For example, if an active server gets partitioned from its peers in that stripe, the active server will remain active and the passive servers on the other side of the partition would elect a new active server as well. Any further operations performed on the data are likely to result in inconsistencies.

When tuned for consistency, a stripe would need at least a majority of servers connected with each other to elect an active server. Thus, even if the stripe gets partitioned into two sets of servers due to some network failure, the set with the majority of servers will elect an active server among them and proceed. In the absence of a majority, an active server will not be elected and hence the clients will be prevented from performing any operations, thereby preserving data consistency by sacrificing availability.

When configuring the stripe, the user needs to choose between availability and consistency as the failover priority of the stripe. To prevent split-brain scenarios and thereby preserve data consistency, failover priority must be set to consistency. However, if availability is preferred, failover-priority can be set to availability at the risk of running into split-brain scenarios.

The following xml snippet shows how to configure a stripe for consistency:

Similarly, the stripe can be tuned for availability as follows:

A Terracotta Server Array (TSA), being a distributed system, is subject to the constraints of the CAP Theorem. The CAP Theorem states that it is impossible for a distributed system to simultaneously provide guarantees for Consistency, Availability, and Partition tolerance. A TSA always seeks to be tolerant of network partitions so a choice must be made between data consistency and service availability.

For a TSA supporting only Ehcache users, choosing availability over consistency can be an effective and proper choice. If a network partition occurs leaving two servers unable to communicate with each other (to coordinate data) but each able to serve clients, choosing availability allows continuing operations using both servers running independently. Data residing on the separated servers will no longer be coordinated (consistency is abandoned) and will drift apart. Once the network partition is resolved and the servers are able to communicate with each other again, consistency rules are re-applied and one server is chosen as the holder of the current data - the data on the other servers is discarded. For a caching application, this may result in some cache misses and delays in some processing but, since a cache is not the system of record for the data being processed, no data is lost.

When using a TSA configured for availability over consistency to support TCStore users, data loss is a real possibility - if a network partition occurs and clients are actively updating TSA servers on both sides of the partition, data loss will occur once the network partition is resolved. If the applications using TCStore are tolerant of missing/inconsistent data - not an easy task - then configuring for availability over consistency is appropriate for the TSA. However, if a TCStore dataset is used as the system of record, this data loss or other inconsistency is generally undesirable if not catastrophic. If the TSA is used for a TCStore dataset which is either a system of record or for which loss/inconsistency is an undesirable outcome, then the TSA must be configured for consistency over availability.

You may use a single TSA supporting both Ehcache and TCStore but the choice of availability versus consistency is a cluster-level configuration - both Ehcache and TCStore users in a TSA are subject to that configuration. If your Ehcache users need high availability and your TCStore users need data consistency, you must use a separate TSA for each user base.

Mandating a majority for active server election in certain topologies introduces additional availability issues. For example, in a two-server stripe the majority quorum is two as well. This means that if these servers get disconnected from each other due to a network partition or because of a server failure, the surviving server would not promote itself as the active server as it requires 2 votes to win the election. But since the other voting server is not reachable, it will not be able to get that second vote and hence will not promote itself. In the absence of an active server, the stripe is not available.

Adding a third server is the best option, so that even if one fails, there is a majority (2 out of 3) surviving to elect an active. A three-server stripe can provide data redundancy and high availability at the same time even when one server fails. If adding a third server is not feasible, the alternate option is to get high availability without risking data consistency (via split-brain scenarios) using an external voter. But this configuration cannot offer data redundancy (like a three-server stripe) if a server fails.

An external voter is a client that is allowed to cast a vote in the election of a new active server, in cases where a majority of servers in a stripe are unable to reach a consensus on electing a new active server.

The number of external voters needs to be described in the server configuration. It is recommended that the total number of servers and external voters be kept as an odd number.

External voters need to get registered with the servers to get added as voting members in their elections. If there are n voters configured in the server, then the first n voting clients requesting to get registered will be added as voters. Registration requests of other clients will be declined and put on hold until one of the registered voters gets de-registered.

Voters can de-register themselves from the cluster so that the voting rights can be transferred to other clients waiting to get registered, if there are any. A voting client can de-register itself by using APIs or by getting disconnected from the cluster.

When a voting client gets disconnected from the server, it will automatically get de-registered by the server. When the client reconnects, it will only get registered again as a voter if another voter has not taken its place while this client was disconnected.

Server configuration

A maximum count for the number of external voters allowed can optionally be added to the failover-priority configuration if the stripe is tuned for consistency, as follows:

The failover priority setting and the specified maximum number of external voters across the stripes must be consistent and will be validated during the cluster configuration step. For more information on how to configure a cluster, see the section Cluster Tool.

Client configuration

External voters can be of two variants:

An external voter can be run as a standalone process using a script provided with the kit. The script takes the tc-config files of the stripes in the cluster as arguments. A variant that takes the <host>:<port> combinations instead of the server configuration files is also supported. Each -s option argument must be a comma separated list of <host>:<port> combinations of servers in a single stripe. To register a multi-stripe cluster, multiple -f or -s options can be provided for each stripe.

To connect the voter to a secure cluster, the path to the security root directory will also have to be provided using the -srd option. For more details on setting up security in a Terracotta cluster, see SSL / TLS Security Configuration in Terracotta.

Any TCStore or Ehcache client can act as an external voter as well by using a voter library distributed with the kit. A client can join the cluster as a voter by creating a TCVoter instance and registering itself with the cluster.

When the voter is no longer required, it can be de-registered from the cluster either by disconnecting that client, or by using the deregister API.

1	Instantiate a TCVoter instance
2	Register the voter with a cluster by providing a cluster name …
3	and host port combinations of all servers in the cluster.
4	De-register from the cluster using the same cluster name that was used to register it.

To connect to a secure cluster, the voter must be instantiated using the overloaded constructor of EnterpriseTCVoterImpl that takes in the security root directory path.

Since an external voter is just another process, there is no guarantee that it will always be up and available. Especially in the form of client voters, the moment the client leaves, the external voter leaves too. In the rare event of a failure happening (partition splitting the active and passive servers or the active server crashing) and the external voter not being around either, none of the surviving servers will be acting as an active server. The servers will be stuck in a suspended state where operations from the regular clients are all stalled. A manual intervention will be required to get the cluster out of this state by fixing the cause of the partition or by restarting the crashed server. If neither is feasible, then the third option is to get a server manually promoted using an override vote from an external voter.

The voter process can be started in an override mode to promote a single server stuck in that intermediate state to be an active server. When the voter process is started in this special mode, it will connect to the server that you want to promote, give it an override vote and exit. The voter process can be started in override mode as follows:

Running this command will forcibly promote the server at HOST:PORT to be an active server, if it is stuck in that intermediate state.

When the failover priority of the stripes is tuned for consistency, it has an impact on server startup as well. In a multi-server stripe, when the servers are started up fresh, a server will not get elected as an active server until it gets votes from all of its peers. This will require all the servers of that stripe to be brought up. Bringing up regular voters is not going to help as they need to communicate with all the active servers in the cluster to get registered. But if bringing up the other servers is not feasible for some reason, then an override voter can be used to forcibly promote that server.

1	Failover priority is tuned to favor…
2	Consistency over availability for this stripe