Cluster Tool

Terracotta 10.7 | Terracotta Server Administration Guide | Cluster Tool

Cluster Tool

The cluster tool is a command-line utility that allows administrators of the Terracotta Server Array to perform a variety of cluster management tasks. For example, the cluster tool can be used to:

Obtain the running status of servers

Dump the state of running servers

Take backups of running servers

Promote a suspended server on startup or failover

Shut down an entire cluster

Perform a conditional partial shutdown of a cluster having one or more passive servers configured for high availability (for upgrades etc.)

The cluster tool script is located in tools/bin under the product installation directory as cluster-tool.bat for Windows platforms, and as cluster-tool.sh for Unix/Linux.

Cluster Tool commands

The cluster tool provides several commands. To list them and their respective options, run cluster-tool.sh (or cluster-tool.bat on Windows) without any arguments, or use the option -help.

The following section provides a list of options common to all commands, and thus need to be specified before the command name:

Precursor options

1. -verbose

This option gives you a verbose output, and is useful to debug error conditions. Default: false.

2. -security-root-directory

This option can be used to communicate with a server which has TLS/SSL-based security configured. For more details on setting up security in a Terracotta cluster, see the section Security Core Concepts.

3. -connection-timeout

This option lets you specify a custom timeout value (in milliseconds) for connections to be established in cluster tool commands. Default: 10s.

4. -request-timeout

This option lets you specify a request timeout value for operations. Default: 10s.

The "status" Command

The status command displays the status of a cluster, or particular server(s) in the same or different clusters.

Syntax:

status [-cluster-name <cluster-name>] [-format json]
-connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-cluster-name <cluster-name>

The name of the configured cluster.

-format json

Output in JSON format. Default is tabular.

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -s option. When provided with option -n, a reachable server in the provided list will be used. Otherwise, the command will be individually executed on each server in the list.

Examples

The example below shows the execution of a cluster-level status command.

The example below shows the execution of a server-level status command. No server is running at localhost:9910, hence the UNREACHABLE status.

./cluster-tool.sh status -connect-to localhost:9410 -connect-to localhost:9510 -connect-to localhost:9910
+------------------------+--------------------+--------------------------+--------------------------------------------+
| Host-Port | Status | Member of Cluster | Additional Information |
+------------------------+--------------------+--------------------------+--------------------------------------------+
| localhost:9410 | ACTIVE | tc-cluster | - |
-----------------------------------------------------------------------------------------------------------------------
| localhost:9510 | PASSIVE | tc-cluster | - |
-----------------------------------------------------------------------------------------------------------------------
| localhost:9910 | UNREACHABLE | - | localhost:9910=Connection refused; |
+------------------------+--------------------+--------------------------+--------------------------------------------+

Error (PARTIAL_FAILURE): Command completed with errors.

To learn more about server states, visit the section Logical Server States.

The "promote" command

The promote command can be used to promote a server stuck in a suspended state. For more information about suspended states, refer to the topics Server startup and Manual promotion with override voter in the section Failover Tuning.

Syntax:

promote -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -s option. The command will be individually executed on each server in the list.

Note:
There is no cluster-wide flavor for this command.

Examples

The example below shows the execution of the promote command on a server stuck in suspended state at localhost:9410.

./cluster-tool.sh promote -connect-to localhost
Following sub-operations were successful:
localhost:9410: Server promotion successful
Command completed successfully.

The example below shows the erroneous execution of a server-level promote command. The server running at localhost:9510 is not in a suspended state to be promoted, hence the failure.

./cluster-tool.sh promote -connect-to localhost:9510
Following sub-operations were unsuccessful:
localhost:9510: com.terracottatech.tools.clustertool.exceptions.ClusterToolException:
Promote command failed as the server is not in a suspended state
Error (FAILURE): Command failed.

The "dump" Command

The dump command dumps the state of a cluster, or particular server(s) in the same or different clusters. The dump of each server can be found in its logs.

Syntax:

dump [-cluster-name <cluster-name>] -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-cluster-name <cluster-name>

The name of the configured cluster.

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -connect-to option. When provided with option -cluster-name, a reachable server in the provided list will be used. Otherwise, the command will be individually executed on each server in the list.

Examples

The example below shows the execution of a cluster-level dump command.

./cluster-tool.sh dump -cluster-name tc-cluster -connect-to localhost:9410
Contacting servers: [localhost:9410]
Using reachable server: localhost:9410 to carry out the operation
Following sub-operations were successful:
localhost:9410: Dump successful
localhost:9510: Dump successful
localhost:9610: Dump successful
localhost:9710: Dump successful

Command completed successfully.

The example below shows the execution of a server-level dump command. No server is running at localhost:9510, hence the dump failure.

./cluster-tool.sh dump -connect-to localhost:9410 -connect-to localhost:9510 -connect-to localhost:9910

Following sub-operations were successful:
localhost:9410: Dump successful
localhost:9510: Dump successful

Following sub-operations were unsuccessful:
localhost:9910:
org.terracotta.diagnostic.client.connection.DiagnosticServiceProviderException:
com.terracotta.connection.api.DetailedConnectionException:
java.util.concurrent.TimeoutException: localhost:9910=Connection refused;

Error (PARTIAL_FAILURE): Command completed with errors.

The "ipwhitelist-reload" Command

The ipwhitelist-reload command reloads the IP whitelist on a cluster, or particular server(s) in the same or different clusters. See the section IP Whitelisting for details.

Syntax:

ipwhitelist-reload [-cluster-name <cluster-name>] -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-cluster-name <cluster-name>

The name of the configured cluster.

-connect-to <hostname[:port]>,<hostname[:port]>...

Examples

The example below shows the execution of a cluster-level ipwhitelist-reload command.

./cluster-tool.sh ipwhitelist-reload -cluster-name tc-cluster -connect-to localhost
Contacting servers: [localhost:9410]
Using reachable server: localhost:9410 to carry out the operation
Following sub-operations were successful:
localhost:9410: IP whitelist reload successful
localhost:9510: IP whitelist reload successful
localhost:9610: IP whitelist reload successful
localhost:9710: IP whitelist reload successful

Command completed successfully.

The example below shows the execution of a server-level ipwhitelist-reload command. No server is running at localhost:9910, hence the IP whitelist reload failure.

./cluster-tool.sh ipwhitelist-reload -connect-to localhost:9410 -connect-to localhost:9510 -connect-to localhost:9910
Following sub-operations were successful:
localhost:9410: IP whitelist reload successful
localhost:9510: IP whitelist reload successful

Following sub-operations were unsuccessful:
localhost:9910:
org.terracotta.diagnostic.client.connection.DiagnosticServiceProviderException:
com.terracotta.connection.api.DetailedConnectionException:
java.util.concurrent.TimeoutException: localhost:9910=Connection refused;

Error (PARTIAL_FAILURE): Command completed with errors.

The "backup" Command

The backup command takes a backup of the running Terracotta cluster. The backup is taken on active servers only. Before taking backup of a cluster, backup-dir needs to be set on each server. For more details about this feature, see Backup, Restore and Data Migration.

Syntax:

backup -cluster-name <cluster-name> -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-cluster-name <cluster-name>

The name of the configured cluster.

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -connect-to option. A reachable server in the provided server list will be used for connection.

Note:
There's no server-level flavor of this command, as backup works at the cluster level only.

Examples

The example below shows the execution of a successful backup command. Note that the server at localhost:9610 was unreachable.

./cluster-tool.sh backup -cluster-name tc-cluster -connect-to localhost:9710 -connect-to localhost:9410
Contacting servers: [localhost:9710, localhost:9410]
Following sub-operations were unsuccessful:
localhost:9710: org.terracotta.diagnostic.client.connection.DiagnosticServiceProviderException:
com.terracotta.connection.api.DetailedConnectionException: java.util.concurrent.TimeoutException: localhost:9710=Connection refused;

Using reachable server: localhost:9410 to carry out the operation

PHASE 0: SETTING BACKUP NAME TO : 996e7e7a-5c67-49d0-905e-645365c5fe28
localhost:9710: TIMEOUT
localhost:9410: SUCCESS
localhost:9510: SUCCESS
localhost:9610: SUCCESS

PHASE (1/4): PREPARE_FOR_BACKUP
localhost:9710: TIMEOUT
localhost:9410: SUCCESS
localhost:9510: NOOP
localhost:9610: SUCCESS

PHASE (2/4): ENTER_ONLINE_BACKUP_MODE
localhost:9410: SUCCESS
localhost:9610: SUCCESS

PHASE (3/4): START_BACKUP
localhost:9410: SUCCESS
localhost:9610: SUCCESS

PHASE (4/4): EXIT_ONLINE_BACKUP_MODE
localhost:9410: SUCCESS
localhost:9610: SUCCESS
Command completed successfully.

The example below shows the execution of a failed backup command.

./cluster-tool.sh backup -cluster-name tc-cluster -connect-to localhost:9610
Contacting servers: [localhost:9610]
Using reachable server: localhost:9610 to carry out the operation

PHASE 0: SETTING BACKUP NAME TO : 93cdb93d-ad7c-42aa-9479-6efbdd452302
localhost:9410: SUCCESS
localhost:9510: SUCCESS
localhost:9610: SUCCESS
localhost:9710: SUCCESS

PHASE (1/4): PREPARE_FOR_BACKUP
localhost:9410: SUCCESS
localhost:9510: NOOP
localhost:9610: SUCCESS
localhost:9710: NOOP

PHASE (2/4): ENTER_ONLINE_BACKUP_MODE
localhost:9410: BACKUP_FAILURE
localhost:9610: SUCCESS

PHASE (CLEANUP): ABORT_BACKUP
localhost:9410: SUCCESS
localhost:9610: SUCCESS

Error (FAILURE): Unable to complete backup.

The "shutdown" Command

The shutdown command shuts down a running Terracotta cluster. During the course of the shutdown process, it ensures that:

Shutdown safety checks are performed on all the servers. Exactly what safety checks are performed will depend on the specified options and is explained in detail later in this section.

All data is persisted to eliminate data loss.

All passive servers are shut down first before shutting down the active servers.

The shutdown command follows a multi-phase process as follows:

1. Check with all servers whether they are OK to shut down. Whether or not a server is OK to shut down will depend on the specified shutdown options and the state of server in question.

2. If all servers agree to the shutdown request, all of them will be asked to prepare for the shutdown. Preparing for shutdown may include the following:

a. Persist all data.

b. Block new incoming requests. This ensures that the persisted data will be cluster-wide consistent after shutdown.

3. If all servers successfully prepare for the shutdown, a shutdown call will be issued to all the servers.

The first two steps above ensure an atomic shutdown to the extent possible as the system can be rolled back to its original state if there are any errors. In such cases, client-request processing will resume as usual after unblocking any blocked servers.

In the unlikely event of a failure in the third step above, the error message will clearly specify the servers that failed to shut down. In this case, use the --force option to forcefully terminate the remaining servers. If there is a network connectivity issue, the forceful shutdown may fail, and the remaining servers will have to be terminated using operating system commands.

Note:
The shutdown sequence also ensures that the data is stripe-wide consistent. Although, it is recommended that clients are shut down before attempting to shut down the Terracotta cluster.

Syntax:

shutdown [ -cluster-name <cluster-name> [-force | -now] ] -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-cluster-name <cluster-name>

The name of the configured cluster.

-force

Forcefully shut down the cluster, even if the cluster is only partially reachable.

-now

Do an immediate shutdown of the cluster, even if clients are connected.

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -connect-to option.

If the -cluster-name option is not specified, this command forcefully shuts down only the servers specified in the list. For clusters having stripes configured for high availability (with at least one passive server per stripe), it is recommended that you use the partial cluster shutdown commands explained in the section below, as they allow conditional shutdown, instead of using the shutdown variant without the -cluster-name option.

If the -cluster-name option is specified (i.e. a full cluster shutdown), this command shuts down the entire cluster. Servers in the provided list will be contacted for connectivity, and the command will then verify the cluster configuration with the given cluster name by obtaining the cluster configuration from the first reachable server. If all servers are reachable, this command checks if all servers in all the stripes are safe to shut down before proceeding with the command.

A cluster is considered to be safe to shut down provided the following are true:

No critical operations such as backup and restore are going on.

No Ehcache or TCStore clients are connected.

All servers in all the stripes are reachable.

If either the -force or -now option is specified, this command works differently than above as follows:

If the -now option is specified, this command proceeds with the shutdown even if clients are connected.

If the -force option is specified, this command proceeds with the shutdown even if none of the conditions specified for safe shutdown above are met.

For all cases, the shutdown sequence is performed as follows:

1. Flush all data to persistent store for datasets or caches that have persistence configured.

2. Shut down all the passive servers, if any, in the cluster for all stripes.

3. Once the passive servers are shut down, issue a shutdown request to all the active servers in the cluster.

The above shutdown sequence is the cleanest way to shut down a cluster.

Examples

The example below shows the execution of a cluster-level successful shutdown command.

./cluster-tool.sh shutdown -cluster-name tc-cluster -connect-to localhost:9410
Contacting servers: [localhost:9410]
Using reachable server: localhost:9410 to carry out the operation

Shutting down cluster: tc-cluster
STEP (1/3): Preparing to shut down
STEP (2/3): Stopping all passive servers first
STEP (3/3): Stopping all active servers
Command completed successfully.

The example below shows the execution of a cluster-level successful shutdown command that fails as one of the servers in the cluster was not reachable.

./cluster-tool.sh shutdown -cluster-name tc-cluster -connect-to localhost:9410
Contacting servers: [localhost:9410]
Using reachable server: localhost:9410 to carry out the operation
Error (FAILURE): Timed out trying to reach the server
Detailed Error Status for Cluster `tc-cluster` :
ServerError{host='localhost:9510', Error='Timed out trying to reach the server'}.
Unable to process safe shutdown request.
Command failed.

The example below shows the execution of a cluster-level successful shutdown command with the force option. Note that one of the servers in the cluster was already down.

./cluster-tool.sh shutdown -force -cluster-name tc-cluster -connect-to localhost:9410
Contacting servers: [localhost:9410]
Using reachable server: localhost:9410 to carry out the operation
Timed out trying to reach the server
Detailed Error Status for Cluster `tc-cluster` :
ServerError{host='localhost:9510', Error='Timed out trying to reach the server'}.
Continuing forced shutdown.

Shutting down cluster: tc-cluster
STEP (1/3): Preparing to shut down
Timed out trying to reach the server
Detailed Error Status :
ServerError{host='localhost:9510', Error='Timed out trying to reach the server'}.
Continuing forced shutdown.
STEP (2/3): Stopping all passive servers first
STEP (3/3): Stopping all active servers
Command completed successfully.

Partial Cluster Shutdown Commands

Partial cluster shutdown commands can be used to partially shut down nodes in the cluster without sacrificing the availability of the cluster. These commands can be used only on a cluster that is configured for redundancy with one or more passive servers per stripe. The purpose of these commands is to allow administrators to perform routine and planned administrative tasks, such as rolling upgrades, with high availability.

The following flavors of partial cluster shutdown commands are available:

shutdown-if-passive

shutdown-if-active

shutdown-all-passives

shutdown-all-actives

As a general rule, if these commands are successful, the specified servers will be shut down. If there are any errors due to which these commands abort, the state of the servers will be left intact.

From the table of server states described in Logical Server States, the following are the different active states that a server may find itself in:

ACTIVE

ACTIVE_RECONNECTING

ACTIVE_SUSPENDED

Note:
In the following sections, the term 'active servers' means servers in any of the active states mentioned above, unless explicitly stated otherwise.

Similarly, the following are the passive states for a server:

PASSIVE_SUSPENDED

SYNCHRONIZING

PASSIVE

Note:
In the following sections, the term 'passive servers' means servers in any of the passive states mentioned above, unless explicitly stated otherwise.

The "shutdown-if-passive" Command

The shutdown-if-passive command shuts down the specified servers in the cluster, provided the following conditions are met:

All the stripes in the cluster are functional and there is one healthy active server with no suspended active servers per stripe.

All the servers specified in the list are passive servers.

Syntax:

shutdown-if-passive -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -connect-to option.

Note:
There's no cluster-level flavor of this command.

Examples

The example below shows the execution of a successful shutdown-if-passive command.

./cluster-tool.sh shutdown-if-passive -connect-to localhost:9510
Contacting servers: [localhost:9510]

Stopping passive node(s): [localhost:9510] of cluster: tc-cluster
STEP (1/2): Preparing to shutdown
STEP (2/2): Stopping if Passive
Command completed successfully.

The example below shows the execution of a failed shutdown-if-passive command, as it tried to shut down a server which is not a passive server.

./cluster-tool.sh shutdown-if-passive -connect-to localhost:9410
Contacting servers: [localhost:9410]
Error (FAILURE): Unable to process the partial shutdown request.
One or more of the specified server(s) are not in passive state
or may not be in the same cluster
Discovered state of all servers are as follows:
Reachable Servers : 2
Stripe #: 1
Node: {localhost:9410} State: ACTIVE
Node: {localhost:9510} State: PASSIVE

Please check server logs for more details.
Command failed.

The "shutdown-if-active" Command

The shutdown-if-active command shuts down the specified servers in the cluster, provided the following conditions are met:

All the servers specified in the list are active servers.

All the stripes corresponding to the given servers have at least one server in 'PASSIVE' state.

Syntax:

shutdown-if-active -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -connect-to option.

Note:
There's no cluster-level flavor of this command.

Examples

The example below shows the execution of a successful shutdown-if-active command:

./cluster-tool.sh shutdown-if-active -connect-to localhost:9410
Contacting servers: [localhost:9410]

Stopping active node(s): [localhost:9410] of cluster: tc-cluster
STEP (1/2): Preparing to shut down
STEP (2/2): Shut down if active server
Command completed successfully.

The example below shows the execution of a failed shutdown-if-active command as the specified server was not an active server.

./cluster-tool.sh shutdown-if-active -connect-to localhost:9510
Contacting servers: [localhost:9510]
Error (FAILURE): Unable to process the partial shutdown request.
One or more of the specified server(s) are not in active state
or may not be in the same cluster.
Reachable Servers : 2
Stripe #: 1
Node : {localhost:9410} State : ACTIVE
Node : {localhost:9510} State : PASSIVE

Please check server logs for more details
Command failed.

The "shutdown-all-passives" Command

The shutdown-all-passives command shuts down all the passive servers in the specified cluster, provided the following is true:

All the stripes in the cluster are functional and there is one active server in 'ACTIVE' state with no suspended active servers per stripe.

All passive servers in all the stripes of the cluster will be shut down when this command is run.

Syntax:

shutdown-all-passives -cluster-name <cluster-name> -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-cluster-name <cluster-name>

The name of the configured cluster.

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -connect-to option. These host(s) need not be passive servers.

Note:
There's no server-level flavor of this command, as it can be used only to shut down all the passive servers in the entire cluster.

The command shuts down all the passive servers in a multi-phase manner as follows:

1. Check with all servers whether it is safe to shut down as a passive server.

2. Flush any data that needs to be made persistent across all servers that are going down and block any further changes.

3. Issue a shutdown request to all passive servers if all passive servers succeed in step 2.

4. If any servers fail in step 2 or above, the shutdown request will fail and the state of the servers will remain intact.

Examples

The example below shows the execution of a successful shutdown-all-passives command.

./cluster-tool.sh shutdown-all-passives -cluster-name tc-cluster -connect-to localhost:9410
Contacting servers: [localhost:9410]
Using reachable server: localhost:9410 to carry out the operation

Stopping passive node(s): [localhost:9510] of cluster: tc-cluster
STEP (1/2): Preparing to shutdown
STEP (2/2): Stopping if Passive
Command completed successfully.

The "shutdown-all-actives" Command

The shutdown-all-actives command shuts down the active server of all stripes in the cluster, provided the following are true:

There are no suspended active servers in the cluster.

There is at least one passive server in 'PASSIVE' state in every stripe in the cluster.

The active server of all stripes of the cluster will be shut down when this command returns success. If the command reports an error, the state of the servers will be left intact.

Syntax:

shutdown-all-actives -cluster-name cluster-name -connect-to <hostname[:port]>,<hostname[:port]>...

Parameters:

-cluster-name cluster-name

The name of the configured cluster.

-connect-to <hostname[:port]>,<hostname[:port]>...

The hostname:port(s) or hostname(s) (default port being 9410) of running servers, each specified using the -connect-to option. These host(s) need not be active servers.

Note:
There's no server-level flavor of this command as it can be used only to shut down all the active servers in the entire cluster.

The command shuts down all the active servers in a multi-phase manner as explained below:

1. Check with all servers whether they are safe to be shut down as active servers.

2. Flush any data that needs to be made persistent across all servers that are going down and block any further changes.

3. Issue a shutdown request to all active servers if they succeed in step 2.

4. If any servers fail in step 2 or above, the shutdown request will fail and the state of the servers will remain as before.

Examples

The example below shows the execution of a successful shutdown-all-actives command. Note that the specified host was a passive server in this example. As the specified host is used only to connect to the cluster and obtain the correct state of all the servers in the cluster, the command successfully shuts down all the active servers in the cluster, leaving the passive servers intact.

./cluster-tool.bat shutdown-all-actives -cluster-name tc-cluster -connect-to localhost:9510
Contacting servers: [localhost:9510]
Using reachable server: localhost:9510 to carry out the operation

Stopping active node(s): [localhost:9410] of cluster: tc-cluster
STEP (1/2): Preparing to shut down
STEP (2/2): Shut down if active server
Command completed successfully.