High availability systems

Overview

High availability systems are used to minimize the risk of the failure of the entire system caused by the failure of an individual component (SPOF, single point of failure). Such systems are usually based on redundant single systems, i.e., multiple identical systems with identical data are operated in parallel. One of these single systems is the main system that users normally access. Another system is the backup system. If the main system fails the backup system takes over automatically. The cause of failure is irrelevant, be it sudden, unpredictable causes like hardware problems or planned maintenance activities, for example.

To identify an error a specific system monitoring message can be used (e.g., the S.M.A.R.T. protocol for identifying hard drive errors). This procedure is complemented by frequent system queries that require a correct response. This type of queries is called heartbeats. Heartbeats can be active at various system levels, like specific application queries (e.g.,  application pings) or system resource monitoring, e.g., hardware monitoring. Errors occur when a system error message occurs or if there is no response to a heartbeat. The reserve system takes over automatically and a message is sent to the system administrator.

After switching to the backup system, the error in the main system can be analyzed and fixed. When the error is fixed the main system can take over again after data synchronization. High availability systems distinguish between the two strategies hot standby and cold standby.

Hot standby means that in case of failure a system remains available even during an active session. Users do not notice if the main system fails and a backup system is activated. The switch to the backup system happens without delay and without interruptions of active user sessions. This strategy is mainly used in mission-critical systems, for example, when safety and health of the general public are threatened.

If the main system fails, cold standby in contrast to hot standby provides a certain time span during which the backup system is activated and the system switches to backup system operation. During the switch phase the system is not available. This means that it is not certain that active user sessions are resumed when the system is available again after switching to the backup system.

PPM high availability system

PPM is an analytical application that imports data, calculates it, and then saves it to a database schema. The database system used must ensure the integrity of the data. You can restore a specific system status any time by importing the reference data again. PPM itself does not support high availability criteria or adaptive computing concepts, such as the restore of an interrupted session between PPM server and client. Indirectly, though, you can implement scenarios in which you can switch between the main system and an existing backup system. However, you need to accept a certain downtime during which the PPM system is unavailable (cold standby).

A 3-level system concept is recommended for setting up such a PPM high availability system. This means that the database server used by PPM is transferred to an independent system that already fulfills high availability criteria. Usually, database manufacturers offer high availability versions of their products. The other PPM system components are installed and operated on yet another system.

Archive the PPM system at regular intervals as described in the chapter Archiving. In case of PPM system failure, proceed as described in the chapter Restore for a new system with comparable hardware properties.