Command |
attach a node to a stripe |
Symptom | The following message is returned: Source node: <node_name> cannot be attached since it is part of an existing cluster with name: <cluster_name> |
Diagnosis | The source node is active and already belongs to an existing cluster which is different than the one to which it is being attached. |
Action | 1. Detach the source node from its existing source stripe. 2. Re-run the original attach command which generated this error. |
Command |
attach a node to a stripe |
Symptom | The following message is returned: Source node: <node_name> is part of a stripe containing more than 1 nodes. It must be detached first before being attached to a new stripe. Please refer to the
Troubleshooting Guide for more help. |
Diagnosis | The source node already belongs to a multi-node cluster which is different than the one to which it is being attached. |
Action | Option A: 1. Detach the node, which is to be attached to the destination stripe, from its existing source stripe. 2. Re-run the original attach command which generated this error. Option B: 1. Re-run the original attach command which generated this error but include the -force option. For example: config-tool.sh attach -to-stripe <destination_stripe:port> -node <source_node:port> -force |
Command |
attach a stripe to a cluster |
Symptom | The following message is returned: Source stripe from node: <node_name> is part of a cluster containing more than 1 stripes. It must be detached first before being attached to a new cluster. Please refer to the Troubleshooting Guide for more help. |
Diagnosis | The source stripe already belongs to a multi-node cluster which is different than the one to which it is being attached. |
Action | Option A: 1. Detach the stripe that is to be attached to the destination cluster from its existing source cluster. 2. Re-run the original attach command which generated this error. Option B: 1. Re-run the original attach command which generated this error but include the -force option. For example: config-tool.sh attach -to-cluster <destination_cluster:port> -stripe <source_stripe:port> -force |
Command |
attach a node to a stripe or a stripe to a cluster |
Symptom | The following message is returned: Impossible to do any topology change. Node: <node_endpoint> is waiting to be restarted to apply some pending changes. Please refer to the Troubleshooting Guide for more help. |
Diagnosis | One or more nodes belonging to the destination cluster have pending changes that require a restart. Ideally, topology changes should only be performed on clusters where the nodes have no pending updates. |
Action | Option A: 1. Restart the node identified by <node_endpoint>. 2. Re-run the original attach command which generated this error. Option B: 1. Re-run the original attach command which generated this error but include the -force option. For example: config-tool.sh attach -to-cluster <destination_cluster:port> -stripe <source_stripe:port> -force |
Command |
attach a node to a stripe or a stripe to a cluster |
Symptom | The following message is returned: An error occurred during the attach transaction. The node/stripe information may still be added to the destination cluster: you will need to run the diagnostic / export command to check the state of the transaction. The node/stripe to attach won't be activated and restarted, and their topology will be rolled back to their initial value. |
Diagnosis | The transaction applying the new topology has failed (the reason is detailed in the logs). It can be caused by an environmental problem (such as network issue, node shutdown, etc) or a concurrent transaction. If the failure occurred during the commit phase (partial commit), some nodes may need to be repaired. |
Action | An 'auto-rollback' will be attempted by the system. Examine output to determine if the auto-rollback was successful. If it was not, then run the
diagnostic command. |
Command |
detach a node from a stripe, or a stripe from a cluster. |
Symptom | The following message is returned: Impossible to do any topology change. Node: <node_name> is waiting to be restarted to apply some pending changes. Please refer to the Troubleshooting Guide for more help. |
Diagnosis | One or more nodes belonging to the destination cluster have pending changes that require a restart. Ideally, topology changes should only be performed on clusters where the nodes have no pending updates. |
Action | Option A: 1. Restart the node identified by <node_name>. 2. Re-run the original detach command which generated this error. Option B: 1. Re-run the original detach command which generated this error but include the -force option. For example: config-tool.sh detach -from-cluster <destination_cluster:port> -stripe <source_stripe:port> -force |
Command |
detach a node from a stripe. |
Symptom | The following message is returned: Nodes to be detached: <node_names> are online. Nodes must be safely shutdown first. Please refer to the Troubleshooting Guide for more help. |
Diagnosis | Ideally, nodes should only be detached when they are not running. Note that when detaching a stripe, the system will automatically stop all detached nodes. But for node detachments, this must be performed manually. |
Action | Option A: 1. Manually stop the node identified by <node_name>. 2. Re-run the original detach command which generated this error. Option B: 1. Re-run the original detach command which generated this error but include the -force option. For example: config-tool.sh detach -from-stripe <destination_stripe:port> -node <source_node:port> -force |
Command | |
Symptom | The following message is returned: IMPORTANT: The sum (<x>) of voter count (<y>) and number of nodes (<z>) in stripe <stripe_name> is an even number. An even-numbered configuration is more likely to experience split-brain situations. |
Diagnosis | Even-numbered counts of voters plus nodes for a given stripe can increase the chances of experiencing split-brain situations. |
Action | Consider making the total count for the stripe an odd number by adding a voter. |
Command | |
Symptom | The following message is returned: Some nodes may have failed to restart within <wait_time> seconds. This should be confirmed by examining the state of the nodes listed below. Note: if the cluster did not have security configured before activation but has security configured post-activation, or vice-versa, then the nodes may have in fact successfully restarted. This should be confirmed. Nodes: <node_name_list> |
Diagnosis | Some mutative commands restart the nodes and then wait for the nodes to come back online. This error message is displayed when the Config Tool was not able to see the node be back online within a delay given by the Config Tool parameter -restart-wait-time. Make sure the value is not too low. |
Action | Execute the following steps: 1. Execute the
diagnostic command for all the nodes that have failed to restart. 2. Examine the Node state value (refer to
node states for more information about the different node states): a. If one of ACTIVE, ACTIVE_RECONNECTING, PASSIVE: the node has restarted correctly. The -restart-wait-time value used with the Config Tool was not high enough. b. If one of ACTIVE_SUSPENDED, PASSIVE_SUSPENDED: the node startup is blocked because the vote count if not correct to pass the desire level of consistency. c. If one of STARTING, SYNCHRONIZING: the node is still starting… Just wait. d. If one of DIAGNOSTIC or UNREACHABLE: the node was unable to start, or has been started in diagnostic mode. Please look at the logs for any error and seek support if necessary. |
Command | |
Symptom | The following message is returned: Please run the 'diagnostic' command to diagnose the configuration state and try to run the 'repair' command. Please refer to the
Troubleshooting Guide for more help. |
Diagnosis | An inconsistency has been found in the cluster configuration and the operation cannot continue without a manual intervention or repair. |
Action | Execute the following steps: 1. Execute the diagnostic command on the cluster. 2. Read the 'Configuration state' message block near the top of the output. 3. Find the message in
Diagnosing Unexpected Errors to understand the underlying problem and how to address it. |
Symptom | The Configuration state of the
diagnostic command output contains: Failed to analyze cluster configuration. |
Diagnosis | The discovery process has failed. Possibly because another client is currently doing a mutative operation. This situation requires to retry the command. |
Action | Run the command again |
Symptom | The Configuration state message block of the diagnostic command output contains this message: Cluster configuration is inconsistent: Change <change_uuid> is committed on <committed_nodes_list> and rolled back on <rolled_back_nodes_list>. |
Diagnosis | Certain changes were found that were committed on some servers and rolled back on other servers. This situation requires a manual intervention, possibly by resetting the node and then re-syncing it after a restricted activation. |
Action | The repair of such a broken configuration state requires rewriting the configuration of certain nodes which will make them temporarily unavailable. To repair such issues, the nodes requiring a reset (nodes that have rolled back) and nodes requiring a reconfiguration (nodes that have committed the change) must be identified. There is no right or wrong answer as it depends on the specific case at hand and the user's intimate knowledge about what command(s) were issued. If the nodes that were committed have started satisfying requests in relation to the addition of a setting (e.g. offheap addition), then such changes need to be forced on the rolled-back node and it must be ensured that these nodes can accept such changes (e.g. enough offheap exists). At the opposite end, if it is known that a committed change has not been used then it can be safely removed. In this case you can consider maintaining the rolled-back nodes and resetting the committed ones. |
Symptom | The Configuration state of the diagnostic command output contains: Cluster configuration is partitioned and cannot be automatically repaired. Some nodes have a different configuration that others. |
Diagnosis | Some nodes ending with a different change UUID leading to different configuration results have been found. Some nodes are running with one configuration, while other nodes are running with a different one. This situation requires a manual intervention, eventually by resetting the node and re-syncing it after a restricted activation. |
Action | This requires a manual intervention analogous to the previously discussed 'Action' - i.e. resetting the configuration of certain nodes. See
Repairing a Broken Configuration. |
Symptom | The Configuration state message block of the diagnostic command output contains this message: A new cluster configuration has been prepared on all nodes but not yet committed. No further configuration change can be done until the 'repair' command is run to finalize the configuration change. |
Diagnosis | All nodes are online and all online nodes have prepared a new change. This situation requires a commit to be replayed, or a rollback to be forced. |
Action | Execute this command: config-tool.sh repair -connect-to <host:port> |
Symptom | The Configuration state of the diagnostic command output contains: A new cluster configuration has been prepared but not yet committed or rolled back on online nodes. Some nodes are unreachable, so we do not know if the last configuration change has been committed or rolled back on them. No further configuration change can be done until the offline nodes are restarted and the 'repair' command is run again to finalize the configuration change. Please refer to the
Troubleshooting Guide if needed. |
Diagnosis | Some nodes are online (not all) and all online nodes have prepared a new change. Because some nodes are down, we do not know if some offline nodes have some more changes in their append.log. This situation requires a commit or a rollback to be forced (only the user knows). |
Action | Because some of the nodes are down, the Config Tool is not able to determine if the change process should be continued and committed, or if it should be rolled back. Only the user knows which action is required. The user must therefore provide the necessary hint to the Config Tool to either force a commit or force a rollback. 1) config-tool.sh repair -connect-to <host:port> -force commit 2) config-tool.sh repair -connect-to <host:port> -force rollback |
Symptom | The Configuration state of the diagnostic command output contains: A new cluster configuration has been partially prepared (some nodes didn't get the new change). No further configuration change can be done until the 'repair' command is run to rollback the prepared nodes. A new cluster configuration has been partially rolled back (some nodes didn't rollback). No further configuration change can be done until the 'repair' command is run to rollback all nodes. |
Diagnosis | A specific change has been prepared on some nodes, while other nodes, which didn't receive that specific change, are ending with a different change. This can happen if a transaction has ended during its prepare phase when the client asks the nodes to prepare themselves. This situation requires a rollback to be replayed. |
Action | Execute this command: config-tool.sh repair -connect-to <host:port> |
Symptom | The Configuration state of the diagnostic command output contains: A new cluster configuration has been partially committed (some nodes didn't commit). No further configuration change can be done until the 'repair' command is run to commit all nodes. |
Diagnosis | A change has been prepared, then committed, but the commit process didn't complete on all online nodes. This situation requires a commit to be replayed. |
Action | Execute this command: config-tool.sh repair -connect-to <host:port> |
Symptom | The Configuration state of the diagnostic command output contains: Unable to determine the global configuration state. There might be some configuration inconsistencies. Please look at each node details. A manual intervention might be needed to reset some nodes. |
Diagnosis | Unable to determine the configuration state of the cluster. |
Action | The user might need to reset the configuration of some nodes. See
Repairing a Broken Configuration. But to be able to determine which nodes to reset and how, some additional support is required. The user has to send all the server logs and configuration directories to the support team. |
Symptom | Any of the following messages are observed when executing the repair command: Failed to analyze cluster configuration. Cluster configuration is inconsistent: Change <change_uuid> is committed on <committed_nodes_list> and rolled back on <rolled_back_nodes_list>. Cluster configuration is partitioned and cannot be automatically repaired. Some nodes have a different configuration that others. Unable to determine the global configuration state. There might be some configuration inconsistencies. Please look at each node details. A manual intervention might be needed to reset some nodes |
Diagnosis | Refer to the same message in the
Diagnostic Command Troubleshooting section. |
Action | Refer to the same message in the
Diagnostic Command Troubleshooting section. |
Symptom | One of following messages is observed when executing the repair command: The configuration is partially prepared. A rollback is needed. The configuration is partially rolled back. A rollback is needed. |
Diagnosis | The repair tool has detected that a rollback is necessary, but the user specified the wrong action. |
Action | Execute one of these commands: config-tool.sh repair -connect-to <host:port> config-tool.sh repair -connect-to <host:port> -force rollback |
Symptom | The following message is observed when executing the repair command: The configuration is partially committed. A commit is needed. |
Diagnosis | The repair tool has detected that a commit is necessary, but the user specified the wrong action. |
Action | Execute one of these commands: config-tool.sh repair -connect-to <host:port> config-tool.sh repair -connect-to <host:port> -force commit |
Symptom | The following message is observed when executing the repair command: Some nodes are offline. Unable to determine what kind of repair to run. Please refer to the Troubleshooting Guide. |
Diagnosis | The repair is unable to determine whether it needs to complete an incomplete change by committing or it needs to rollback because some nodes are down. This is up to the user to hint the repair command about what to do. |
Action | Execute one of these commands: config-tool.sh repair -connect-to <host:port> config-tool.sh repair -connect-to <host:port> -force commit |