Failover
In a high-availability stripe, the failure of a single server represents only a small disruption, but not outright failure, of the cluster and the
client operations (for related information on high availability, see the section
Cluster Architecture).
In the case of a failing passive server, there is no disruption at all experienced by the clients.
In the case of a failing active server, however, there is a small disruption of client progress until a new active server is elected and the client can reconnect to it. Failover is the name given to this scenario.
Client Reconnect Window
When a failover happens, the clients connected to the previous active server automatically switch to the new active server. However, these clients have a limited window of time called the client reconnect window to complete the failover (120 seconds, by default). The new active server will stop processing any client requests until all the previously known clients connect back or until this window expires. This could cause all the clients to stall even if a single client fails or takes too long to fail over to the new active server.
If clients fail to connect back to the new active server within the reconnect window, the server will consider them unreachable and will continue processing requests from the connected clients. Clients reconnecting after the reconnect window will be rejected by the server and they will rejoin the cluster as a new client by establishing a new connection.
This reconnect window can be configured in the Terracotta configuration file using the <client-reconnect-window> element. The following XML snippet shows how the client reconnect window can be changed to 60 seconds:
<tc-config>
...
<servers>
...
<client-reconnect-window>60</client-reconnect-window>
</servers>
</tc-config>
Server-side implications
Once all clients have reconnected (or the reconnect window closes), the server will process all re-sent messages it had seen before for which the client had not been notified of completion.
After this, message processing resumes as normal.
Client-side implications
Clients will experience a slight stall while they reconnect to the new active server. This reconnection process involves re-sending any messages the client considers to be in-flight.
After this, client operations resume as normal.