High Availability Is Key

When talking about the endurance of a system some factors need to be discussed first.

Availability means the reachability of a system or service over a specific period.

Resiliency talks about the high degree of recovery after failure.

In system theory, the term fault tolerance refers to the attribute that enables a system to continue operating. That is, in case of an event of one failure or several failures of some of its components.

Reliability states the habit of having a dependable system.

Redundancy is a system design that duplicates critical components to provide alternatives. For example, in case one component fails.

An application operating on a single machine has a single point of failure. This would lead to poor system reliability. Meaning that with high certainty there will be downtime during recovery. Here, redundancy can play a crucial role in keeping your application running. This will improve your system reliability.

To achieve this, you can run your application on a second system. Here, we are talking about an active-active configuration. It does not need to be receiving traffic. It is also possible that the second system is not currently running (cold system), but your application is already preconfigured on it. This configuration type is called an active-passive configuration.

If the system is detecting a problem there is a different procedure for each type:

A failure on an active-active system would instantly lead to failover to the second machine.

A failure on an active-passive system would only failover to the second machine when it is live.

In our case, a cluster consisting of several nodes is an active-active system, because the nodes are running a process synchronously. This system is considered fault-tolerant. Meaning that the loss of a node can be easily compensated for. Not only are clusters more fault-tolerant, but they are quite resilient, by rejoining recovered nodes in the system and thus obtaining the cluster's optimal size.

In the event of failure, a system needs both, resiliency and fault-tolerance, to avoid costs due to system outages. So, these clusters enable higher availability and are known as highly available or fail-over clusters.