Config Tool Troubleshooting Guide

Terracotta 10.7 | Terracotta Server Administration Guide | Config Tool | Config Tool Troubleshooting Guide

Overview

The Config Tool provides detailed validation and error reporting when executing any of its commands. Most of the reported messages also include guidance on how to address/correct the underlying issue.

Some of the more import occurrences of failed validation scenarios and how to address them are described here in Validation and Warning Messages.

Occurrences of unexpected errors and how to identify and correct them are described in Diagnosing Unexpected Errors.

⚠ Before Performing Any Troubleshooting

Before performing any troubleshooting, please read and understand this section as it contains important background information.

When a cluster is activated, any Config Tool mutative operation such as attach, detach, set, unset, or activate will proceed through the following steps:

1. Validation / Sanity checks: validation of the CLI input and sanity checks against the cluster.

2. Execution of the change process in three phases:

a. DISCOVERY: the cluster configuration is verified to ensure that all nodes are ready to accept a new change.

a. Success: the PREPARE phase is started.

b. Error: Nothing has been changed thus far; we can simply retry or consult this guide for additional help.

b. PREPARE: the change is sent to the targeted nodes for validation purposes and is written into a file which keeps track of the changes of a node. This phase does not update the runtime configuration, or the configuration used at startup. This is a validation step that indicates to the nodes what will be the next configuration to use if the validation passes on all the nodes.

a. Success: the COMMIT phase is started.

b. Failure: the ROLLBACK phase is started and the change that has been prepared will be marked as rolled back.

c. COMMIT: if all the nodes have validated and accepted the change, the Config Tool asks all nodes to commit the change, which will change the runtime configuration if the change supports being applied at runtime. The cluster configuration used to start the nodes will be changed accordingly.

a. Success: the change process is completed.

b. Failure: Failure during a commit should never occur because the change has been validated and accepted upfront. These failures are caused by either a programmatic error or an environment change between the time the change has been validated and the time the commit has commenced. Hopefully the repair command can be used to replay this commit phase.

d. ROLLBACK

a. Success: the change process is cancelled.

b. Failure: Failure during a rollback should never occur. Hopefully the repair command can be used to replay this rollback phase.

3. Restart of the nodes: some commands require some or all nodes to be restarted - either automatically, or performed manually by the user after having been warning to do so by the Config Tool. This is especially true for changes requiring nodes to be restarted because the changes cannot be applied at runtime.

Validation and Warning Messages

Attaching a Node to a Stripe

Command	attach a node to a stripe
Symptom	The following message is returned: Source node: <node_name> cannot be attached since it is part of an existing cluster with name: <cluster_name>
Diagnosis	The source node is active and already belongs to an existing cluster which is different than the one to which it is being attached.
Action	1. Detach the source node from its existing source stripe. 2. Re-run the original attach command which generated this error.

Command	attach a node to a stripe
Symptom	The following message is returned: Source node: <node_name> is part of a stripe containing more than 1 nodes. It must be detached first before being attached to a new stripe. Please refer to the Troubleshooting Guide for more help.
Diagnosis	The source node already belongs to a multi-node cluster which is different than the one to which it is being attached.
Action	Option A: 1. Detach the node, which is to be attached to the destination stripe, from its existing source stripe. 2. Re-run the original attach command which generated this error. Option B: 1. Re-run the original attach command which generated this error but include the -force option. For example: config-tool.sh attach -to-stripe <destination_stripe:port> -node <source_node:port> -force

Attaching a Stripe to a Cluster

Command	attach a stripe to a cluster
Symptom	The following message is returned: Source stripe from node: <node_name> is part of a cluster containing more than 1 stripes. It must be detached first before being attached to a new cluster. Please refer to the Troubleshooting Guide for more help.
Diagnosis	The source stripe already belongs to a multi-node cluster which is different than the one to which it is being attached.
Action	Option A: 1. Detach the stripe that is to be attached to the destination cluster from its existing source cluster. 2. Re-run the original attach command which generated this error. Option B: 1. Re-run the original attach command which generated this error but include the -force option. For example: config-tool.sh attach -to-cluster <destination_cluster:port> -stripe <source_stripe:port> -force

Attaching Nodes or Stripes

Command	attach a node to a stripe or a stripe to a cluster
Symptom	The following message is returned: Impossible to do any topology change. Node: <node_endpoint> is waiting to be restarted to apply some pending changes. Please refer to the Troubleshooting Guide for more help.
Diagnosis	One or more nodes belonging to the destination cluster have pending changes that require a restart. Ideally, topology changes should only be performed on clusters where the nodes have no pending updates.
Action	Option A: 1. Restart the node identified by <node_endpoint>. 2. Re-run the original attach command which generated this error. Option B: 1. Re-run the original attach command which generated this error but include the -force option. For example: config-tool.sh attach -to-cluster <destination_cluster:port> -stripe <source_stripe:port> -force

Command	attach a node to a stripe or a stripe to a cluster
Symptom	The following message is returned: An error occurred during the attach transaction. The node/stripe information may still be added to the destination cluster: you will need to run the diagnostic / export command to check the state of the transaction. The node/stripe to attach won't be activated and restarted, and their topology will be rolled back to their initial value.
Diagnosis	The transaction applying the new topology has failed (the reason is detailed in the logs). It can be caused by an environmental problem (such as network issue, node shutdown, etc) or a concurrent transaction. If the failure occurred during the commit phase (partial commit), some nodes may need to be repaired.
Action	An 'auto-rollback' will be attempted by the system. Examine output to determine if the auto-rollback was successful. If it was not, then run the diagnostic command.

Detaching Nodes or Stripes

Command	detach a node from a stripe, or a stripe from a cluster.
Symptom	The following message is returned: Impossible to do any topology change. Node: <node_name> is waiting to be restarted to apply some pending changes. Please refer to the Troubleshooting Guide for more help.
Diagnosis	One or more nodes belonging to the destination cluster have pending changes that require a restart. Ideally, topology changes should only be performed on clusters where the nodes have no pending updates.
Action	Option A: 1. Restart the node identified by <node_name>. 2. Re-run the original detach command which generated this error. Option B: 1. Re-run the original detach command which generated this error but include the -force option. For example: config-tool.sh detach -from-cluster <destination_cluster:port> -stripe <source_stripe:port> -force

Command	detach a node from a stripe.
Symptom	The following message is returned: Nodes to be detached: <node_names> are online. Nodes must be safely shutdown first. Please refer to the Troubleshooting Guide for more help.
Diagnosis	Ideally, nodes should only be detached when they are not running. Note that when detaching a stripe, the system will automatically stop all detached nodes. But for node detachments, this must be performed manually.
Action	Option A: 1. Manually stop the node identified by <node_name>. 2. Re-run the original detach command which generated this error. Option B: 1. Re-run the original detach command which generated this error but include the -force option. For example: config-tool.sh detach -from-stripe <destination_stripe:port> -node <source_node:port> -force

Split Brain Warning

Command	Commands that alter a stripe's total node count including attach, detach, and import.
Symptom	The following message is returned: IMPORTANT: The sum (<x>) of voter count (<y>) and number of nodes (<z>) in stripe <stripe_name> is an even number. An even-numbered configuration is more likely to experience split-brain situations.
Diagnosis	Even-numbered counts of voters plus nodes for a given stripe can increase the chances of experiencing split-brain situations.
Action	Consider making the total count for the stripe an odd number by adding a voter.

Errors or unexpected issues

"Some nodes may have failed to restart within…"

Command	Any mutative command including attach, detach, activate, set, or unset
Symptom	The following message is returned: Some nodes may have failed to restart within <wait_time> seconds. This should be confirmed by examining the state of the nodes listed below. Note: if the cluster did not have security configured before activation but has security configured post-activation, or vice-versa, then the nodes may have in fact successfully restarted. This should be confirmed. Nodes: <node_name_list>
Diagnosis	Some mutative commands restart the nodes and then wait for the nodes to come back online. This error message is displayed when the Config Tool was not able to see the node be back online within a delay given by the Config Tool parameter -restart-wait-time. Make sure the value is not too low.
Action	Execute the following steps: 1. Execute the diagnostic command for all the nodes that have failed to restart. 2. Examine the Node state value (refer to node states for more information about the different node states): a. If one of ACTIVE, ACTIVE_RECONNECTING, PASSIVE: the node has restarted correctly. The -restart-wait-time value used with the Config Tool was not high enough. b. If one of ACTIVE_SUSPENDED, PASSIVE_SUSPENDED: the node startup is blocked because the vote count if not correct to pass the desire level of consistency. c. If one of STARTING, SYNCHRONIZING: the node is still starting… Just wait. d. If one of DIAGNOSTIC or UNREACHABLE: the node was unable to start, or has been started in diagnostic mode. Please look at the logs for any error and seek support if necessary.

"Please run the 'diagnostic' command to diagnose the configuration state…"

Command	Any mutative command including attach, detach, activate, set, unset, or repair
Symptom	The following message is returned: Please run the 'diagnostic' command to diagnose the configuration state and try to run the 'repair' command. Please refer to the Troubleshooting Guide for more help.
Diagnosis	An inconsistency has been found in the cluster configuration and the operation cannot continue without a manual intervention or repair.
Action	Execute the following steps: 1. Execute the diagnostic command on the cluster. 2. Read the 'Configuration state' message block near the top of the output. 3. Find the message in Diagnosing Unexpected Errors to understand the underlying problem and how to address it.

Diagnosing Unexpected Errors

Even if command validations pass, commands can still fail as a result of unexpected errors. The diagnostic command is used to help identify the underlying error condition of the cluster.

The diagnostic command output reveals detailed status information for each node in the cluster including its: online, activation, health, restart, and state statuses. Last changed configuration details are also displayed.

The command also reveals details about certain unexpected failure conditions. This information is included in the Configuration state message block that appears near the top of the diagnostic output.

The following 'Configuration state' messages require attention and action.

Symptom	The Configuration state of the diagnostic command output contains: Failed to analyze cluster configuration.
Diagnosis	The discovery process has failed. Possibly because another client is currently doing a mutative operation. This situation requires to retry the command.
Action	Run the command again

Symptom	The Configuration state message block of the diagnostic command output contains this message: Cluster configuration is inconsistent: Change <change_uuid> is committed on <committed_nodes_list> and rolled back on <rolled_back_nodes_list>.
Diagnosis	Certain changes were found that were committed on some servers and rolled back on other servers. This situation requires a manual intervention, possibly by resetting the node and then re-syncing it after a restricted activation.
Action	The repair of such a broken configuration state requires rewriting the configuration of certain nodes which will make them temporarily unavailable. To repair such issues, the nodes requiring a reset (nodes that have rolled back) and nodes requiring a reconfiguration (nodes that have committed the change) must be identified. There is no right or wrong answer as it depends on the specific case at hand and the user's intimate knowledge about what command(s) were issued. If the nodes that were committed have started satisfying requests in relation to the addition of a setting (e.g. offheap addition), then such changes need to be forced on the rolled-back node and it must be ensured that these nodes can accept such changes (e.g. enough offheap exists). At the opposite end, if it is known that a committed change has not been used then it can be safely removed. In this case you can consider maintaining the rolled-back nodes and resetting the committed ones. See Repairing a Broken Configuration

Symptom	The Configuration state of the diagnostic command output contains: Cluster configuration is partitioned and cannot be automatically repaired. Some nodes have a different configuration that others.
Diagnosis	Some nodes ending with a different change UUID leading to different configuration results have been found. Some nodes are running with one configuration, while other nodes are running with a different one. This situation requires a manual intervention, eventually by resetting the node and re-syncing it after a restricted activation.
Action	This requires a manual intervention analogous to the previously discussed 'Action' - i.e. resetting the configuration of certain nodes. See Repairing a Broken Configuration.

Symptom	The Configuration state message block of the diagnostic command output contains this message: A new cluster configuration has been prepared on all nodes but not yet committed. No further configuration change can be done until the 'repair' command is run to finalize the configuration change.
Diagnosis	All nodes are online and all online nodes have prepared a new change. This situation requires a commit to be replayed, or a rollback to be forced.
Action	Execute this command: config-tool.sh repair -connect-to <host:port>

Symptom	The Configuration state of the diagnostic command output contains: A new cluster configuration has been prepared but not yet committed or rolled back on online nodes. Some nodes are unreachable, so we do not know if the last configuration change has been committed or rolled back on them. No further configuration change can be done until the offline nodes are restarted and the 'repair' command is run again to finalize the configuration change. Please refer to the Troubleshooting Guide if needed.
Diagnosis	Some nodes are online (not all) and all online nodes have prepared a new change. Because some nodes are down, we do not know if some offline nodes have some more changes in their append.log. This situation requires a commit or a rollback to be forced (only the user knows).
Action	Because some of the nodes are down, the Config Tool is not able to determine if the change process should be continued and committed, or if it should be rolled back. Only the user knows which action is required. The user must therefore provide the necessary hint to the Config Tool to either force a commit or force a rollback. 1) config-tool.sh repair -connect-to <host:port> -force commit 2) config-tool.sh repair -connect-to <host:port> -force rollback

Symptom	The Configuration state of the diagnostic command output contains: A new cluster configuration has been partially prepared (some nodes didn't get the new change). No further configuration change can be done until the 'repair' command is run to rollback the prepared nodes. A new cluster configuration has been partially rolled back (some nodes didn't rollback). No further configuration change can be done until the 'repair' command is run to rollback all nodes.
Diagnosis	A specific change has been prepared on some nodes, while other nodes, which didn't receive that specific change, are ending with a different change. This can happen if a transaction has ended during its prepare phase when the client asks the nodes to prepare themselves. This situation requires a rollback to be replayed.
Action	Execute this command: config-tool.sh repair -connect-to <host:port>

Symptom	The Configuration state of the diagnostic command output contains: A new cluster configuration has been partially committed (some nodes didn't commit). No further configuration change can be done until the 'repair' command is run to commit all nodes.
Diagnosis	A change has been prepared, then committed, but the commit process didn't complete on all online nodes. This situation requires a commit to be replayed.
Action	Execute this command: config-tool.sh repair -connect-to <host:port>

Symptom	The Configuration state of the diagnostic command output contains: Unable to determine the global configuration state. There might be some configuration inconsistencies. Please look at each node details. A manual intervention might be needed to reset some nodes.
Diagnosis	Unable to determine the configuration state of the cluster.
Action	The user might need to reset the configuration of some nodes. See Repairing a Broken Configuration. But to be able to determine which nodes to reset and how, some additional support is required. The user has to send all the server logs and configuration directories to the support team.

Errors from the repair command

The repair command is used to repair the cluster's health by fixing cluster configuration inconsistency issues on one or more nodes. Repairs only work on activated nodes - not on nodes that are running in diagnostic mode. The following errors might be observed when executing the repair command:

Symptom	Any of the following messages are observed when executing the repair command: Failed to analyze cluster configuration. Cluster configuration is inconsistent: Change <change_uuid> is committed on <committed_nodes_list> and rolled back on <rolled_back_nodes_list>. Cluster configuration is partitioned and cannot be automatically repaired. Some nodes have a different configuration that others. Unable to determine the global configuration state. There might be some configuration inconsistencies. Please look at each node details. A manual intervention might be needed to reset some nodes
Diagnosis	Refer to the same message in the Diagnostic Command Troubleshooting section.
Action	Refer to the same message in the Diagnostic Command Troubleshooting section.

Symptom	One of following messages is observed when executing the repair command: The configuration is partially prepared. A rollback is needed. The configuration is partially rolled back. A rollback is needed.
Diagnosis	The repair tool has detected that a rollback is necessary, but the user specified the wrong action.
Action	Execute one of these commands: config-tool.sh repair -connect-to <host:port> config-tool.sh repair -connect-to <host:port> -force rollback

Symptom	The following message is observed when executing the repair command: The configuration is partially committed. A commit is needed.
Diagnosis	The repair tool has detected that a commit is necessary, but the user specified the wrong action.
Action	Execute one of these commands: config-tool.sh repair -connect-to <host:port> config-tool.sh repair -connect-to <host:port> -force commit

Symptom	The following message is observed when executing the repair command: Some nodes are offline. Unable to determine what kind of repair to run. Please refer to the Troubleshooting Guide.
Diagnosis	The repair is unable to determine whether it needs to complete an incomplete change by committing or it needs to rollback because some nodes are down. This is up to the user to hint the repair command about what to do.
Action	Execute one of these commands: config-tool.sh repair -connect-to <host:port> config-tool.sh repair -connect-to <host:port> -force commit

Manual Repair

Unlocking a Locked Configuration

If a dynamic scale operation fails, it is possible for it to leave the cluster configuration locked, which will prevent any further mutation actions from being performed.

A locked cluster configuration can be unlocked with the following command:

config-tool.sh repair -connect-to <host:port> -force unlock

Repairing a Broken Configuration

If some nodes possess a broken configuration, which makes the cluster configuration inconsistent, these nodes can be repaired. This procedure will "force" the repaired node to acquire the same configuration as the nodes that are considered to be sane (i.e. correct).

Note:

It is up to the user to decide which nodes are considered to be correct and which nodes should be repaired.

When forcing a configuration onto certain nodes, the cluster configuration will become consistent, but other issues can occur if it is not verified that the environment can support the new configuration.

Steps:

1. Decide which nodes are sane and which nodes require repairing. In our example, node1 is sane and node2 will be repaired. We will attempt to force the configuration from node1 to be installed in node2.

2. We backup the cluster configuration:

config-tool.sh export -connect-to node1 -output-file <config-file>

3. If node2 is down, restart it with the exact same command-line that was previously used but append -repair-mode at the end of the file in order for it to enter repair mode.

4. node2 starts in diagnostic mode. Reset its configuration and stop it with:

config-tool.sh repair -force reset -connect-to node2

5. Optional step: if you need to cleanup the data of this node, this is a good time to do that. To do so, backup and remove all the content of the metadata-dir and all the content of every data-dir. Backing up and removing the data will allow your node to start completely like new. You can keep your data ONLY in cases where you know the node can restart safely with it.

6. Start node2 again, but this time start it normally, without -repair-mode. node2 should start in diagnostic mode once again, waiting for a configuration to be pushed.

7. Run a restricted activation on node2. The node should activate, restart, sync with current active and become passive.

config-tool.sh activate -restrict -connect-to node2 -config-file <config-file>