Automatic Recovery Scenarios
Orchestrator Failover
It is recommended to run multiple Orchestrators in a region, either on the same machine or rack, or in different racks to ensure availability. Although only one master Orchestrator is mandatory, you can start one or more standby Orchestrators at any time during application run time to provide failover protection.
Standby Orchestrators are passive, so running extra Orchestrator processes will have minimal runtime impact under normal operations. Upon failure, the other regions will look for the next available Orchestrator and resume replication (there will be a full sync for replica caches after master failover).
Note:
The failover process for a Replica Orchestrator is not done in the same fault-tolerant manner as for failover of a Master Orchestrator. Once the standby Replica Orchestrator takes over, it re-syncs all Replica caches from scratch, then switches to normal replication mode. This behavior can cause some slowness if the Master cache contains a very large number of entries.
Master-Replica Disconnection
When a Master cache fails, control is given to the failover Master (if any) listed in
wan-config.xml. A Replica cache will not take over as a Master. If there is no failover Master, the Replicas will continue to operate in isolation (i.e., no replication will take place). When the Master re-starts, the behavior of its Replicas is governed by the
replicaDisconnectBehavior property in
wan-config.xml. By default, the Replicas will attempt to reconnect to the Master and to the failover Master listed in
wan-config.xml. Upon reconnection to a Master, the Replicas are deactivated, cleared, resynchronized to the Master cache, and then reactivated. All local changes on the Replica region will be dropped in favor of whatever is in the Master region. (Similarly, even if the Master does not fail but its Replicas become disconnected from their Master, their behavior is also controlled by the
replicaDisconnectBehavior property.) For more information, refer to
Orchestrator Configuration Parameters.
If no failover Master is listed in wan-config.xml and the master is lost, the operator has the option to restart a Replica to act as a Master, as described below.
In a bi-directional configuration without a failover Master, you should choose your most important Replica region to take over as the Master because any changes that occurred since the Master failed will be lost in all other regions. To do this, change your wan-config.xml to reflect that the Replica is now the Master, and then restart the Replica. It would be a good idea to remove or comment out the old Master in case it comes back.
In a uni-directional configuration without a failover Master, since the "writes" are only performed in one region, you will not lose any changes. You may simply designate that Replica region as the new Master.
TSA Disconnection
Upon any disconnection from its TSA, the Orchestrator deactivates and waits the amount of time specified by the l2.l1reconnect.timeout.millis property. This property is described in "Automatic Client Reconnect" in the BigMemory Max High-Availability Guide.
When communication between the TSA and the Orchestrator is resumed:
If the downtime was shorter than the configured reconnect timeout, then WAN replication will resume immediately, without the need for any resynchronization.
If the downtime exceeded the configured reconnect timeout, then the Orchestrator will resynchronize all of its Master caches, and subsequently those Master caches will resynchronize all of their Replica caches.
Cache Recovery Operations
Unidirectional cache
In master region, cache operations remains operational even if local orchestrator went down
In replica region, cache operations remains operational even if local orchestrator went down but replica region won't receive any updates from the master region. Once orchestrator comes up, it will connect to master orchestrator and deactivates the cache (cache operations blocked) and does a full sync
Only if replicaDisconnectBehavior = reconnectResync (default ).
Bidirectional cache
In master region, cache operations remains operational even if orchestrator went down
In replica region, cache operations wait for the local orchestrator to come up and after local orchestrator comes up, it will connect to master region and does full sync
Only if replicaDisconnectBehavior = reconnectResync (default ).
With standby orchestrator, standby orchestrator becomes active orchestrator automatically and
in replica region, both unidirectional and bidirectional cache operations wait for full sync to be completed and cache activation.
Only if replicaDisconnectBehavior = reconnectResync (default ).
in master region, both unidirectional and bidirectional cache operations continue to operate and new active orchestrator syncs all replica orchestrator.
Only if replicaDisconnectBehavior = reconnectResync (default ).
Master Region Failure
If the master region fails entirely, there will most likely be data writes that were not completed (updated to Replica Region), though application may believe they were. As described in this section, recommended action would be to promote Replica region caches to master.