Common Causes of Failures in a Cluster

The most common causes of failures in a cluster are interruptions in the network and long Java GC cycles on particular nodes. Tuning the HealthChecker and reconnect features can reduce or eliminate these two problems. However, additional actions should also be considered.

Sporadic disruptions in network connections between L2s and between L2s and L1s can be difficult to track down. Be sure to thoroughly test all network segments connecting the nodes in a cluster, and also test network hardware. Check for speed, noise, reliability, and other applications that grab bandwidth.

Other sources of failures in a cluster are disks that are nearly full or are running slowly, and running other applications that compete for a node's resources.

Ensure that your application does not interrupt clustered threads. This is a common error that can cause the Terracotta client to shut down or go into an error state, after which it will have to be restarted.

The Terracotta client library runs with your application and is often involved in operations which your application is not necessarily aware of. These operations can get interrupted, something the Terracotta client cannot anticipate. Interrupting clustered threads, in effect, puts the client into a state which it cannot handle.

If clients disconnect on a regular basis, try the following to diagnose the cause:

Analyze the Terracotta client logs for potential issues, such as long GC cycles.

Analyze the Terracotta server logs for disconnection information and any rejections of reconnection attempts by the client.

See the operator events panel in the Terracotta Management Console for disconnection events, and note the reason.

If the disconnections are due to long GC cycles or inconsistent network connections in the client, consider the remedies suggested in this section. If disconnections continue to happen, consider configuring caches with nonstop behavior and enabling rejoin.

Terracotta server and client logs contain messages that help you track memory usage. Locations of server and client logs are configured in the Terracotta configuration file, tc-config.xml.

You can view the state of memory usage in a node by finding messages similar to the following:

2011-12-04 14:47:43,341 [Statistics Logger] ... memory free : 39.992699 MB
2011-12-04 14:47:43,341 [Statistics Logger] ... memory used : 1560.007301 MB
2011-12-04 14:47:43,341 [Statistics Logger] ... memory max : 1600.000000 MB

These messages can indicate that the node is running low on memory.

The TSA may be configured to be restartable in addition to including searchable caches, but both of these features require disk storage. When both are enabled, be sure that enough disk space is available. Depending upon the number of searchable attributes, the amount of disk storage required may be up to 1.5 times the amount of in-memory data.