Handling node failure and failover
A node may stop processing events from time to time. This may be because it is stopped for planned maintenance, or the node failed in some way. In these cases:
Events that have been delivered to the node but not yet processed will be lost. This will typically be a small window of events.
Patterns that end with a
wait operator may be dropped if they are being handled by a node that has stopped. For example, if a query contains the pattern
MyEvent:e -> wait(5.0), and the node that receives a matching
MyEvent stops before 5 seconds are up, then the pattern will not be matched in this instance.
If using JMS, then events continue to be delivered to and processed by other correlators in the cluster. The failed correlator will not hold up processing on other nodes. Other nodes continue processing events, including matching against events that the failed node had previously received (if they had been processed).
Any clients connected to the failed correlator will need to re-connect to another correlator. The same set of parameterized query instances is kept in synchronization across the cluster. See
Managing parameterized query instances.
Similarly, nodes running a BigMemory Terracotta Server Array may fail. For this reason, BigMemory should be configured with sufficient backups to ensure no data is lost in this case. Consult the
Terracotta documentation.