BigMemory 4.4.0 | Product Documentation | BigMemory Max Administrator Guide | Terracotta Server Array Architecture | Failover Tuning for Guaranteed Consistency
 
Failover Tuning for Guaranteed Consistency
In a clustered environment any network, hardware, or other issues can result in an active node to partition from the cluster.
The detection of such a situation results in an Active-left event.
As described in the previous sections, the default behavior of a TSA would be that the remaining passive node will then run an election and, if not finding the active node (5 sec default), take over as Active.
While this configuration ensures a tight availability of the data, risks of experiencing a so-called split brain situation during such elections are increased.
In the case of a TSA, split brain would be a situation in which both nodes are acting as Active. Any further operations performed on the data are likely to result in inconsistencies.
So, depending on mission priorities, a cluster may also be configured to emphasize consistency.
Note:
Using Failover Tuning requires the feature to be part of the license key. Contact your nearest Software AG representative in case of questions.
AVAILABILITY versus CONSISTENCY
AVAILABILITY
default setting is 5 seconds
Passives automatically become Actives
CONSISTENCY
explicit settings <failover-priority>
WAITING-FOR-PROMOTION state requesting operator to issue a failover-action command
Note:
Even in the absence of any configuration value, the default behavior is Availability.
Supported Configuration
Tuning the failover priority to CONSISTENCY can be applied to clusters consisting of up to two data centers.
Each mirror group in the cluster has one active and one mirror node. Mirror groups that have more than one mirror node are not supported.
A very common example scenario would be Integration Server instances (one per data center) that require a shared (Active-Active replicated) database.
IS Active Active
How to Switch to failover-priority CONSISTENCY:
Since the default setting is AVAILABIILITY, changing this setting must be performed explicitly on the following property:

<tc-config>
<servers>
...
<failover-priority>
CONSISTENCY
</failover-priority>
...
</servers>
</tc-config>
Note: 
In the CONSISTENCY scenario, no mirror group can have more than two servers configured, otherwise the startup would fail.
The CONSISTENCY setting has no effect in a single node stripe but will take effect for the modified stripe once a new server is added.
How Failover Works in Various Scenarios
If the Active Node Fails ?
If the active node fails, the mirror node automatically stops processing, but all data is preserved and there are no lost transactions. At this point, the mirror node can only determine that the connection to the active node has been lost, but cannot determine whether the active node has failed, or whether there is just a break in the network connection. The mirror node sets its own status to WAITING_FOR_PROMOTION and waits for human interaction to determine why the connection has been lost.
If the Mirror Node Fails ?
If the mirror node fails, the active node continues operation without interruption. Human intervention is required to restart the mirror node.
If the Network Connection Fails ?
If the network connection fails, the active node cannot determine if the mirror node is still operating. However, the active node will continue without interruption. The mirror node cannot determine if the active node is still operating, so the mirror node will proceed as if the active node is not operating.
See above for details of how the mirror node reacts in this case.
What Happens to Transactions in Transit During Failover?
Provided the client is running, transactions will not be lost as they will be replayed to the new Active.
Monitoring Using Server Stats Script
The server-stat.[sh/bat] script packaged with this software is the ideal tool to monitor a cluster configured for Fail-Over tuning. The script delivers information regarding the states of the clustered servers. The following block shows part of the output of server-stat.[sh/bat] in a live 2-node cluster:

g1s1.health: OK
g1s1.role: ACTIVE
g1s1.state: ACTIVE-COORDINATOR
g1s1.port: 9540
g1s1.group name: group1

g1s2.health: OK
g1s2.role: PASSIVE
g1s2.state: PASSIVE-STANDBY
g1s2.port: 9640
g1s2.group name: group1
In the case of an Active server crashing, the staus deliverd by server-stat.[sh/bat] will deliver an output as below:

localhost.health: unknown
localhost.role: unknown
localhost.state: unknown
localhost.port: 9540
localhost.group name: unknown
localhost.error: Connection refused to localhost:9540. Is the TSA running?

g1s2.health: OK
g1s2.role: WAITING-FOR-PROMOTION
g1s2.state: PASSIVE-STANDBY
g1s2.port: 9640
g1s2.group name: group1
The "role" field of the Passive server indicates that is waiting for promotion.
Monitoring Using REST Endpoints
The server-stat utility internally uses a rest endpoint on the servers to fetch the information shown in the output. The same REST endpoint can be addressed directly to get the same information using the following URL:
<server:mgmt-port>/tc-management-api/v2/local/stat
Pointing to this URL to a Passive node that is waiting for promotion would give a response such as below:
{"health":"OK","role":"WAITING-FOR-PROMOTION","state":"PASSIVE-STANDBY",
"managementPort":"<port number>","serverGroupName":"group1","name":"g1s2"}
The "role" attribute indicates that the Passive server is waiting for promotion.
Monitoring Using TMC
In the case of a partition of Active and Passive, TMC will receive operator events indicating that the Passive is waiting for promotion.
However, in the case of failure of the Active node acting as active-coordinator of the cluster, TMC will be unable to deliver any useful information on the cluster.
In all other cases, TMC provides accurate operator events.
CONSISTENCY: How to Start Up and What to Consider
The command for starting up a server can be extended by the flag --active
$KIT/server/bin/start-server.sh[bat] -f /path/to/tc-config.xml -n <server-name> --active
*If this flag is set, then that node will run an election and if won, will become the Active.
*If an Active is found during the election, then this flag is ignored and the node will join the cluster as a Passive.
*If this flag is not set at all, then this node will look for an Active until such is found and, in the case of an Active responding, join as a Passive.
CONSISTENCY: The fail-over-action Command
When a Passive standby running with failover-priority "CONSISTENCY" detects that a node has left, then it will move to WAITING-FOR-PROMOTION state.
This state ...
*raises operator alerts with an operator alert appearing in the TMC as well as in log files
Note:
TMC may not be accessible should stripe 0 Active be out of operation. Consequentially, integration of the alert into 3rd party software or accessing logs may be required.
*initiates continuous logging
*waits for an external trigger
This trigger is provided by the fail-over-action command that must be performed by an operator.
$KIT/server/bin/fail-over-action.sh[bat] -f /path/to/tc-config.xml -n <server-name> --promote|--restart|--failFast
This command will call a REST endpoint to pass the fail-over action to the node specified by the <server-name>
Action
Description
--promote
The node will move to the ACTIVE_COORDINATOR state, provided the node was currently in the WAITING-FOR-PROMOTION state.
--restart
The node will log appropriately, shutdown, and mark the DB as dirty. The server will restart automatically.
--failFast
The node will log appropriately and shutdown without any changes to the database. The server will not restart automatically.