Handling node failure and failover

A node may stop processing events from time to time. This may be because it is stopped for planned maintenance, or the node failed in some way. In these cases:

This does not apply if you are sending and receiving events via JMS where you have configured JMS for reliable messaging. See Avoiding message loss with JMS for more information.

If using JMS, then events continue to be delivered to and processed by other correlators in the cluster. The failed correlator will not hold up processing on other nodes. Other nodes continue processing events, including matching against events that the failed node had previously received (if they had been processed).

Any clients connected to the failed correlator will need to re-connect to another correlator. The same set of parameterized query instances is kept in synchronization across the cluster. See Managing parameterized query instances.

Similarly, nodes running a Terracotta Server Array may fail. For this reason, TCStore or BigMemory should be configured with sufficient backups to ensure no data is lost in this case. Consult the Terracotta documentation.

If all of your incoming and outgoing events are received/sent via correlator-integrated JMS (see also Correlator-Integrated Support for the Java Message Service (JMS)) and if this has been configured with APP_CONTROLLED receivers and BEST_EFFORT senders (see also Sending and receiving reliably without correlator persistence), then no events are lost in the event of a node failure. Any events that have been delivered from JMS to queries on that node are then handled by another node if they had not been fully processed before the failure. Any events sent to JMS by queries on that node are delivered by another node if they had not been successfully delivered before the failure.

This only works if the queries (or a chain of queries) are receiving events directly from JMS receivers and are sending their output directly to JMS senders. There are no guarantees if EPL monitors are processing query input or output, interposing themselves between the queries and JMS.

No EPL monitors in the same correlator should be performing acknowledgments to APP_CONTROLLED receivers themselves, as those receivers are entirely under the control of the queries runtime.

Incoming events may be delivered twice or be delivered out of order during the failover window. This is the time between the node failure and the cluster (including the JMS broker) detecting the failure/disconnection. It is your responsibility to make sure that your queries are not sensitive to duplicates or re-ordering within this failover window.

You should also ensure that the JMS broker does not lose messages in the case of a broker failure. Make sure that all JMS senders have their messageDeliveryMode property set to PERSISTENT, as well as doing any necessary broker-specific configuration on the broker itself.

Note: Reliable messaging will not take effect unless your queries are exclusively using correlator-integrated JMS as their message source and destination. It does not apply when using connectivity plug-ins as your event source or destination (even if they support reliable messaging).