Troubleshooting Questions

After my application interrupted a thread (or threw InterruptedException), why did the Terracotta client die?

The Terracotta client library runs with your application and is often involved in operations which your application is not necessarily aware of. These operations may get interrupted, too, which is not something the Terracotta client can anticipate. Ensure that your application does not interrupt clustered threads. This is a common error that can cause the Terracotta client to shut down or go into an error state, after which it will have to be restarted.

There can be many reasons for a cluster that was performing well to slow down over time. The most common reason for slowdowns is Java Garbage Collection (GC) cycles. Another reason may be that near-memory-full conditions have been reached, and the TSA needs to clear space for continued operations by additional evictions. For more information, see Managing Near-Memory-Full Conditions.

Another possible cause is when an active server is syncing with a mirror server. If the active is under substantial load, it may be slowed by syncing process. In addition, the syncing process itself may appear to slow down. This can happen when the mirror is waiting for specific sequenced data before it can proceed, indicated by log messages similar to the following:

WARN com.tc.l2.ha.L2HACoordinator - 10 messages in pending queue.
Message with ID 2273677 is missing still

If the message ID in the log entries changes over time, no problems are indicated by these warnings.

Another indication that slowdowns are occurring on the server and that clients are throttling their transaction commits is the appearance of the following entry in client logs:

INFO com.tc.object.tx.RemoteTransactionManagerImpl - ClientID[2](:
TransactionID=[65037) : Took more than 1000ms to add to sequencer : 1497 ms

If you are not running the server in restartable mode, the server will remove the object data when it restarts. If you want object data to persist across server restarts, run the server in restartable mode. For information, see Configuring Fast Restart (FRS).

If you are running the server in restartable mode, the server keeps the object data across restarts. If you want objects to disappear when you restart the server, you can either disable restartable mode or remove the data files from disk before you restart the server. See the FAQ question "How do I enable restartable mode?"

A firewall may be preventing different nodes on a cluster from seeing each other. If Terracotta clients attempt to connect to a Terracotta server, for example, but the server seems to not have any knowledge of these attempts, the clients may be blocked by a firewall. Another example is a backup Terracotta server that comes up as the active server because it is separated from the active server by a firewall.

Client or server processes that quit ("L1 Exiting" or "L2 Exiting" in logs) for seemingly no visible reason may have been running in a terminal session that has been terminated. The parent process must be maintained for the life of the node process, or use another workaround such as the nohup option.

I have a setup with one active Terracotta server instance and a number of standbys, so why am I getting errors because more than one active server comes up?

Due to network latency or load, the Terracotta server instances may not may be have adequate time to hold an election. Increase the <election-time> property in the Terracotta configuration file to the lowest value that solves this issue.

If you are running on Ubuntu, see the note at the end of the UnknownHostException topic in the section Specific Errors and Warnings.

I have a cluster with more than one stripe (more than one active Terracotta server) but data is distributed very unevenly between the two stripes.

The Terracotta Server Array distributes data based on the hashcode of keys. To enhance performance, each server stripe should contain approximately the same amount of data. A grossly uneven distribution of data on Terracotta servers in a cluster with more than one active server can be an indication that keys are not being hashed well. If your application is creating keys of a type that does not hash well, this may be the cause of the uneven distribution.

If running in restartable mode, the ACTIVE Terracotta server instance should come up with all shared data intact. However, if the server's database has somehow become corrupt, you must clear the crashed server's data directory before restarting.

I lost some data after my entire cluster lost power and went down. How can I ensure that all data persists through a failure?

If only some data was lost, then Terracotta servers were configured to persist data. The cause for losing a small amount of data could be disk "write" caching on the machines running the Terracotta server instances. If every Terracotta server instance lost power when the cluster went down, data remaining in the disk cache of each machine is lost.

Turning off disk caching is not an optimal solution because the machines running Terracotta server instances will suffer a substantial performance degradation. A better solution is to ensure that power is never interrupted at any one time to every Terracotta server instance in the cluster. This can be achieved through techniques such as using uninterruptible power supplies and geographically subdividing cluster members.

Errors could occur if a client runs with a web application that has been redeployed, causing the client to not start properly or at all. If the web application is redeployed, be sure to restart the client.

You may be encountering a known issue with the Hotspot JVM for SPARC. The problem is expected to occur with Hotspot 1.6.0_08 and higher, but may have been fixed in a later version. For more information, see this bug report.