Failover Tuning
Overview
A Terracotta Server Array (TSA), being a distributed system, is subject to the constraints of the CAP Theorem. The CAP Theorem states that it is impossible for a distributed system to simultaneously provide guarantees for consistency, availability, and partition tolerance. A TSA always seeks to be tolerant of network partitions so a choice must be made between data consistency and service availability. This choice is declared with the failover-priority setting which instructs the cluster to favor either consistency or availability. When a network or hardware failure occurs resulting in disconnected servers, the cluster's behavior is governed by this setting.
If a cluster is defined to favor consistency, then failures resulting in disconnected servers will result in all client requests being halted. This is due to an inability to guarantee consistent reads and writes when the cluster is partitioned.
If a cluster is defined to favor availability, then failures resulting in disconnected servers will still permit the cluster to respond to client requests. However, client responses are not guaranteed to be consistent because there can potentially be multiple active servers working on the same data set.
In the absence of any error-induced cluster partitioning, the cluster will provide both consistency and availability.
Elections and Split-Brains
When a failure that results in disconnected servers within a cluster configured to favor availability over consistency occurs, each set of partitioned servers will perform an election. Servers that can communicate with each other will elect an active server among themselves and continue operations. If the failure results in a single server being isolated, that server will elect itself as active. Opting to favor availability over consistency comes with the possibility of experiencing a so-called split-brain situation.
For a TSA, a split-brain occurs when multiple servers within a stripe are simultaneously acting as active servers. For example, if an active server becomes partitioned from its peers in the stripe, the active server will remain active and the passive servers on the other side of the partition, unable to reach the active server, will elect a new active server from their own isolated group. Any further operations performed on the data are likely to result in inconsistencies because there can be multiple active servers working on the same data set.
When configured for consistency, a stripe requires a majority of servers connected with each other to elect an active server. Thus, even if the stripe gets partitioned into two sets of servers due to some network failure, the set with the majority of servers will elect an active server among them and proceed. In the absence of a majority, an active server will not be elected and hence the clients will be prevented from performing any operations, thereby preserving data consistency by sacrificing availability.
Server configuration
When configuring the cluster, you must choose between availability or consistency as the failover priority of the cluster. To prevent split-brain scenarios and thereby preserve data consistency, failover priority must be set to consistency. However, if availability is preferred, failover-priority can be set to availability at the risk of running into split-brain scenarios. The following snippet shows how to configure a stripe for consistency:
Configured via config file:
failover-priority=consistency
Configured via command line during server startup (use -failover-priority option):
start-tc-server.sh -failover-priority consistency
Similarly, the cluster can be tuned for availability as follows:
failover-priority=availability
Note:
failover-priority is a mandatory parameter that must be provided during server startup if the cluster configuration contains more than one node.
Choosing Consistency versus Availability
For a TSA supporting only Ehcache users, choosing availability over consistency can be an effective and proper choice. If a network partition occurs leaving two servers unable to communicate with each other (to coordinate data) but each able to serve clients, choosing availability allows continuing operations using both servers running independently. Data residing on the separated servers will no longer be coordinated (consistency is abandoned) and will drift apart. Once the network partition is resolved and the servers are able to communicate with each other again, consistency rules are re-applied and one server is chosen as the holder of the current data - the data on the other servers is discarded. For a caching application, this may result in some cache misses and delays in some processing but, since a cache is not the system of record for the data being processed, no data is lost.
When using a TSA configured for availability over consistency to support TCStore users, data loss is a real possibility - if a network partition occurs and clients are actively updating TSA servers on both sides of the partition, data loss will occur once the network partition is resolved. If the applications using TCStore are tolerant of missing/inconsistent data - not an easy task - then configuring for availability over consistency is appropriate for the TSA. However, if a TCStore dataset is used as the system of record, this data loss or other inconsistency is generally undesirable if not catastrophic. If the TSA is used for a TCStore dataset which is either a system of record or for which loss/inconsistency is an undesirable outcome, then the TSA must be configured for consistency over availability.
You may use a single TSA supporting both Ehcache and TCStore but the choice of availability versus consistency is a cluster-level configuration - both Ehcache and TCStore users in a TSA are subject to that configuration. If your Ehcache users need high availability and your TCStore users need data consistency, you must use a separate TSA for each user base.
External voter for two-server stripes
In certain topologies, mandating a voter majority for an active server election can introduce availability issues. For example, in a two-server stripe the majority quorum is two votes. If these servers were to become disconnected from each other due to a network partition or because of a server failure, the surviving server would not be able to promote itself to the active server as it requires two votes to win the election. Since one of the two servers is not available/reachable, the missing second vote will render the stripe unavailable.
Adding a third server is the best option. A three-server stripe can provide data redundancy and high availability at the same time even when one server fails. In this topology, if one server fails, there is still a majority of surviving servers (2 out of 3) to elect an active server. If adding a third server is not feasible, an option to get high availability without risking data consistency (via split-brain scenarios) is to use an external voter. This configuration cannot offer data redundancy (like a three-server stripe) if a server fails.
An external voter is a client that is able to cast a vote in an election for a new active server in cases where a majority of servers in a stripe are unable to reach a consensus on electing a new active server.
External voter configuration
The number of external voters must be defined in the server configuration. It is recommended that the total number of servers and external voters be kept as an odd number.
External voters must be registered with the servers in order to get added as voting members in their elections. If there are n voters configured in the server, then the first n voting clients requesting registration will be added as voters. Registration requests of other clients will be declined and put on hold until one of the registered voters becomes deregistered.
Voters can de-register themselves from the cluster so that the voting rights can be transferred to other clients waiting to get registered, if there are any. A voting client can deregister itself by using APIs or by getting disconnected from the cluster.
When a voting client gets disconnected from the server it will automatically get deregistered by the server. When the client reconnects, it will only get re-registered as a voter if another voter has not taken its place while this client was disconnected.
Server configuration
A maximum count for the number of external voters allowed can optionally be added to the failover-priority configuration if the cluster is tuned for consistency, as follows:
failover-priority=consistency:3
1 | Here, the total number of voting clients is restrictred to three. |
The failover priority setting and the specified maximum number of external voters across the stripes must be consistent and will be validated during the cluster configuration step. For more information on how to activate a cluster, see the
Settings section.
Client configuration
External voters can be of two variants:
1. Standalone voter
2. Clients using the voter library (client voter)
Standalone voter
An external voter can be run as a standalone process using the start_tc_voter script located in the tools/voter/bin/ folder under the product installation directory. The script accepts the <hostname>:<port> for the cluster's servers as arguments.
Each -connect-to option argument must be a comma separated list of <hostname>:<port> combinations of servers within a single stripe. To register a multi-stripe cluster, multiple -connect-to options can be provided for each stripe.
Usage:
start-tc-voter.(sh|bat) -connect-to hostname:port[,hostname:port]... [-connect-to hostname:port[,hostname:port]...]...
To connect the voter to a secure cluster, the path to the security root directory must also be specified using the
-security-dir option. For more details on setting up security in a Terracotta cluster, see
SSL / TLS Security Configuration in Terracotta.
Client voter
Any TCStore or Ehcache client can act as an external voter as well by using a voter library distributed with the kit. A client can join the cluster as a voter by creating a TCVoter instance and registering itself with the cluster.
Note:
The cluster must be activated using the config tool before a client voter can be registered with it.
When the voter is no longer required, it can be deregistered from the cluster either by disconnecting that client, or by using the deregister API.
import com.terracottatech.voter.EnterpriseTCVoterImpl;
import org.terracotta.voter.TCVoter;
TCVoter voter = new EnterpriseTCVoterImpl(); //1
voter.register("my-cluster-0" //2
"<hostname>:<port>","<hostname>:<port>"); //3
...
voter.deregister("my-cluster-0") //4
1 | Instantiate a TCVoter instance |
2 | Register the voter with a cluster by providing a cluster name … |
3 | and <hostname>:<port > combinations of all servers in the cluster. |
4 | Deregister from the cluster using the same cluster name that was used to register it. |
To connect to a secure cluster, the voter must be instantiated using the overloaded constructor of EnterpriseTCVoterImpl that accepts the security root directory path.
Manual promotion with override voter
Since an external voter is itself a running process, there is no guarantee that it too will always be up and available. The moment the client voter leaves, the external voter leaves with it.
In the rare event that a failure occurs (i.e. a partition splitting the active and passive servers or the active server crashing/stopping) with the external voter no longer being present, none of the surviving servers can automatically become an active server. The servers will be stuck in a suspended state whereby operations from regular clients will be stalled.
In this scenario, a manual intervention is required in order to move the cluster out of this partitioned state by either (1) fixing the cause of the partition or (2) restarting the crashed server. If neither option is feasible, then a third option of manually promoting a server to the active state by casting an override vote from an external voter is required.
The voter process can be started in override mode to promote a single server to become an active server, when that server is stuck in an intermediate state. When the voter process is started in this mode it will connect to the specified server to be promoted and give it an override vote. The voter process will then exit. The voter process is started in override mode as follows:
start-tc-voter.(sh|bat) -vote-for <hostname>:<port>
Running this command will forcibly promote the server at <hostname>:<port> to be an active server (if it is stuck in that intermediate state).
Note:
This override voting will work even if external voters are not configured in the server configuration.
Warning:
Be cautious not to start two different override voters on both sides of the partition separately so that both sides win and cause a split-brain.
Server startup
When the failover priority of the stripes is tuned for consistency, it has an impact on server startup as well. In a multi-server stripe, when the servers are started up fresh, a server will not get elected as an active server until it gets votes from all of its peers. This will require all the servers of that stripe to be brought up. Bringing up regular voters is not going to help as they need to communicate with all the active servers in the cluster to get registered. But if bringing up the other servers is not feasible for some reason, then an override voter can be used to forcibly promote that server.