Safe Cluster Shutdown and Restart Procedures

BigMemory 4.3.9 | BigMemory Max Best Practices | Safe Cluster Shutdown and Restart Procedures

Although the Terracotta Server Array is designed to be crash tolerant, like any distributed system with HA capabilities, it is important to consider the implications of shutting down and restarting servers, what sequence that is done in, and what effects that has on client applications and potential loss of some data.

See the section Safe Cluster Shutdown and Restart Automation Guidelines for guidelines on automating the safe cluster shutdown and restart procedure.

The safest shutdown procedure

For the safest shutdown procedure, follow these steps:

1. Shut down the clients. The Terracotta client will shut down when you shut down your application.

2. Shut down the servers that are in 'standby' mode (passive servers) using the stop-tc-server script.

3. Shut down the active servers (there will be one per stripe) using the stop-tc-server script.

For information on using the stop-tc-server script, refer to the section Start and Stop Server Scripts (start-tc-server, stop-tc-server) in the Administration Guide.

The safest restart procedure

For the safest restart procedure, follow these steps:

1. For each stripe, first start the server that was last in 'active' state.

2. Wait for the server to reach 'active coordinator' state.

3. Start the other servers of each stripe.

If clustered data is not persisted (i.e. the servers are not configured to be 'restartable'), any of the servers can be started first as no data conflicts can take place.

Considerations and implications of not following the above procedure

Facts to understand:

Servers that are in "active" status have the "master" or "source of full truth" copy of data for the stripe they belong to. They also have state information about in-progress client transactions, and client-held locks.

Mirror servers (in "passive standby" state) have a "nearly" up to date copy of the data and state information. (Any information that they don't have is redundant between the active server and the client.)

If the active server fails (or is shut down purposely), not only does the standby server need to reach active state, but the clients also need to reconnect to it and complete their open transactions, or data may be lost.

A Terracotta Server Array, or "Cluster" instance has an identity, and the stripes within the TSA have a "stripe ID". In order to protect data integrity, running clients ensure that they only "fail over" to servers with matching IDs to the ones they were last connected to. If cluster or stripe is completely "wiped" of data (by purposely clearing persisted data, or having persistence disabled and having all stripe members stopped at the same time), that will reset the stripe ID.

What happens if clients are not shut down

If clients are not shut down:

Client applications will continue sending transactions (data writes) to the active server(s) as normal, right up until the active server is stopped. This may leave some successful transactions unacknowledged, or falsely reported as failed to the client, possibly resulting in some data loss.

If the Terracotta servers were not configured as 'restartable', or the servers' data was wiped before restarting the servers, then clients will fail to reconnect to the server and require restart anyway. This is because the clients will continue to attempt to reconnect to the previous cluster instance (to complete open transactions, and resume normal activity), however the restarted servers will have a new cluster ID that the client will not agree with and a "stripe ID mismatch" error will be reported.

What happens if the active server is shut down first

If the active server is shut down first:

Before shutting down any other servers, or restarting the server, ensure that you wait until any other servers in the stripe (that were in 'standby' status) have reached active coordinator state, and that any running clients have reconnected and re-sent their partially completed transactions. Otherwise there may be some data loss.

What happens if a server other than the previously active server is re-started first

If a server other than the previously active server is re-started first:

The server will complain that its database is 'dirty' and fail to restart without intervention, because it knows that another (the active) server has the "master" copy.