BigMemory 4.3.10 | Product Documentation | BigMemory Max High-Availability Guide | Testing High-Availability Deployments | High-Availability Network Architecture and Testing | Deployment Configuration: Fully Redundant
 
Deployment Configuration: Fully Redundant
Deployment Configuration: Fully Redundant
Description
This is the fully redundant network configuration. It relies on the failover capabilities of Terracotta, the switches, and the operating system. In this scenario it is even possible to sustain certain double failures and still maintain a fully functioning cluster.
In this diagram, the IP addressing scheme is merely to demonstrate that the L1s (L1a & L1b) can be on a different subnet than the L2s (TCserverA & TCserverB). The actual addressing scheme will be specific to your environment. If you choose to implement with a single subnet, then there will be no need for VRRP/HSRP but you will still need to configure a single VLAN (can be VLAN 1) for all TC cluster machines.
In this diagram, there are two switches that are connected with trunked links for redundancy and which implement Virtual Router Redundancy Protocol (VRRP) or HSRP to provide redundant network paths to the cluster servers in the event of a switch failure. Additionally, all servers are configured with both a primary and secondary network link which is controlled by the operating system. In the event of a NIC or link failure on any single link, the operating system should fail over to the backup link without disturbing (e.g. restarting) the Java processes (L1 or L2) on the systems.
The Terracotta fail over is identical to that in the simple case above, however both NIC cards on a single host would need to fail in this scenario before the TC software initiates any fail over of its own.
Additional configuration
*Switch - Switches need to implement VRRP or HSRP to provide redundant gateways for each subnet. Switches also need to have a trunked connection of two or more lines in order to prevent any single link failure from splitting the virtual router in two.
*Operating System - Hosts need to be configured with bonded network interfaces connected to the two different switches. For Linux, choose mode 1. More information about Linux channel bonding can be found in the Linux Bonding Driver description at http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding. Pay special attention to the amount of time it takes for your VRRP or HSRP implementation to reconverge after a recovery. You don't want your NICs to change to a switch that is not ready to pass traffic. This should be tunable in your bonding configuration.
Test Plan - Network Failures Redundant Network
The following tests continue the tests listed in Network Failures (Pt. 1). Use these tests to confirm that your network is configured properly.
TestID
Failure
Expected Outcome
FS8
Loss of any primary network link
Failover to standby link
FS9
Loss of all primary links
All nodes fail to their secondary link
FS10
Loss of any switch
Remaining switch assumes VRRP address and switches fail over NICs if necessary
FS11
Loss of any L1 (both links or system)
Cluster continues as normal using only other L1
FS12
Loss of Active L2
mirror L2 becomes the new Active L2, All L1s fail over to the new Active L2
FS13
Loss of mirror L2
Cluster continues as normal without TC redundancy
FS14
Loss of both switches
non-functioning cluster
FS15
Loss of single link in switch trunk
Cluster continues as normal without trunk redundancy
FS16
Loss of both trunk links
possible non-functioning cluster depending on VRRP or HSRP implementation
FS17
Loss of both L1s
non-functioning cluster
FS18
Loss of both L2s
non-functioning cluster
Test Plan - Network Testing Redundant Network
After the network has been configured, you can test your configuration with simple ping tests and various failure scenarios.
The test plan for Network Testing consists of the following tests:
TestID
Host
Action
Expected Outcome
NT4
any
ping every other host
successful ping
NT5
any
pull primary link during continuous ping to any other host
failover to secondary link, no noticeable network interruption
NT6
any
pull standby link during continuous ping to any other host
no effect
NT7
Active L2
pull both network links
mirror L2 becomes Active, L1s fail over to new Active L2
NT8
Mirror L2
pull both network links
no effect
NT9
switchA
reload
nodes detect link down and fail to standby link, brief network outage if VRRP transition occurs
NT10
switchB
reload
brief network outage if VRRP transition occurs
NT11
switch
pull single trunk link
no effect