Running a Configuration Health Check
Overview
The HealthChecker is a tool for checking the correctness of a realm or cluster configuration.
The tool is primarily intended for use by Software AG support staff for analyzing possible problems in customer configurations, but you might also find it useful for checking your configuration.
The tool can be used in the following ways:
To check the configuration of a live realm (which can be a single entity or a node of a cluster) or a cluster. If the realm is a node of the cluster, the checks will also be automatically executed against all the other cluster members.
To do an offline check of the configuration of a realm or cluster, based on configuration information that has been exported to XML files. Each such XML file contains the configuration data of a realm, regarding channels, queues, durables, datagroups, etc. The tool will only run the checks against all the cluster members if their XML paths are given explicitly in the call of the tool.
Typical configuration aspects that can be checked in a clustered realm are:
Datagroups: Datagroups belonging to a cluster must be present on all nodes of the cluster and their attributes must be the same.
Durables: Durables belonging to clusterwide channels should also be clusterwide. They must be present on all nodes of the cluster and their attributes must be the same.
Joins: Joins between clusterwide channels must be present on all nodes of the cluster and their attributes have to be the same.
Stores: Stores belonging to a cluster must be present on all nodes of the cluster and their attributes and properties must be the same.
Typical configuration aspects that can be checked in a non-clustered realm are:
Durables: Durables belonging to a non-clustered realm must be non-clusterwide and must be attached to a non-clusterwide channel.
Stores: Some store configurations may impact the performance of the system and they need to be highlighted.
Checks against a live realm
The checks that can be run against a live realm are the following:
Name of Check | Description | Default check? |
DataGroupMismatchCheck | Check if datagroups are coherent across all nodes of the cluster. | |
DurableMismatchCheck | Check if durables are coherent across all nodes of the cluster. | Y |
DurableSubscriberLargeStoreCheck | Check the number of remaining events to be consumed in a shared durable. If the number is greater than the threshold a warning will be displayed. The default value for the threshold is 1000. This check takes an additional parameter -threshold that allows you to specify a custom value for the threshold. | Y |
DurableWarningCheck | Check durable which has no JMSEngine and no active consumer. | Y |
EnvironmentStateCheck | Check and display the status of a running environment. The HealthChecker first checks if the server configuration property Enable Flow Control in the configuration group Server Protection (see the note after this table) is set to true. If it is set to true then the HealthChecker will check what percentage of memory is taken by events from the whole heap memory. If the percentage exceeds a certain value, an appropriate log message will be displayed (see details below). The general idea is that the Server Protection mechanism gradually slows the consumption of events from clients when certain thresholds are reached. The degree of slowing down is marked by these three server configuration properties: FlowControlWaitTimeOne, FlowControlWaitTimeTwo and FlowControlWaitTimeThree. These represent a period of time, measured in milliseconds, by which client publishing requests will be delayed when the corresponding threshold has been reached. Threshold 70%-80%: A log message is displayed that client publishing requests will be delayed by FlowControlWaitTimeOne milliseconds. Threshold 80%-90%: A log message is displayed that client publishing requests will be delayed by FlowControlWaitTimeTwo milliseconds. Threshold 90%-94%: A log message is displayed that client publishing requests will be delayed by FlowControlWaitTimeThree milliseconds. Threshold higher than 94%: A log message is displayed that client publishing requests will be delayed by a long duration (24 days). | Y |
FixLevelCheck | Check if all nodes of a cluster are on the same fix level. It is strongly recommended they are. | Y |
JNDIStatusCheck | Check JNDI status and mismatches for stores. | Y |
JoinLastEIDMismatchCheck | Check for joins that are ahead of the channel's last EID. You can use this check both on standalone and cluster UM servers. | Y |
JoinMismatchCheck | Check if joins are coherent across all nodes of the cluster. | Y |
JoinSyncWarningCheck | Check if join is out of sync on the cluster. | Y |
RealmACLCheck | Check the realm ACL access control list. | |
RealmConfigurationChecks | Check the realm configuration with default configuration list. | Y |
ResourcesSafetyLimitsCheck | Check that channel/queue resources have either TTL or Capacity configured to a non-zero value. If both of these values are zero, this means that the channel/queue is not configured with any safety limits. | |
ServerProtectionConsistencyCheck | Check if the server configuration properties in the configuration group Server Protection group (see the note after this table) are coherent across the nodes of a cluster against a running environment. | |
StoreACLCheck | Check store ACL warnings. | |
StoreMemoryCheck | Check the memory usage of stores. | Y |
StoreMismatchCheck | Check if stores are coherent across all nodes of the cluster. | Y |
StoreWarningsCheck | Check store warnings on the specified realm. | Y |
A "Y" in the column "Default check?" indicates that the check is included in the -mode=default setting (see the topic The -mode parameter below).
Important:
When the tool checks for outstanding durable events or event ID mismatches in a live environment, there is a chance of getting warning messages, even though the cluster is working correctly. This is because the check is not atomic for the live cluster, so a small synchronization discrepancy can be expected.
Note:
For further information about the server configuration parameters and the configuration group
Server Protection mentioned in the table above, see the section
Realm Configuration.
Checks against a realm's stored XML configuration
The checks that can be run against a stored XML configuration are the following:
Name of Check | Description | Default check? |
DataGroupMismatchCheck | Check if datagroups are coherent across all the nodes of the cluster. | |
DurableMismatchCheck | Check if durables are coherent across all nodes of the cluster. | Y |
FixLevelCheck | Check if all nodes of a cluster are on the same fix level. It is strongly recommended they are. | Y |
JNDIStatusCheck | Check JNDI status and mismatches for stores on all the nodes of the cluster. | Y |
JoinMismatchCheck | Check if joins are coherent across all nodes of the cluster. | Y |
RealmACLCheck | Check realm ACL warnings. | |
RealmConfigurationCheck | Check realm configuration property warnings. | Y |
ResourcesSafetyLimitsCheck | Check that channel/queue resources have either TTL or Capacity (maximum number of events) configured to a non-zero value. If both of these values are zero, this means that the channel/queue is not configured with any safety limits. | |
ServerProtectionConsistencyCheck | Check if the server configuration properties in the configuration group Server Protection are coherent across the nodes of a cluster against the exported configurations from the nodes. | |
StoreACLCheck | Check store ACL security checks. | |
StoreMismatchCheck | Check if stores are coherent across all nodes of the cluster. | Y |
StoreWarningsCheck | Check store warnings on the specified realm. | Y |
Command Usage
Displaying help text
To display a help text showing a summary of the command usage, call the HealthChecker without parameters:
runUMTool HealthChecker
Command Syntax
The HealthChecker requires either the -rname parameter, which offers checks against a live realm, or the -xml parameter, which offers checks against a realm's stored XML configuration. Note that you cannot use these two parameters in the same invocation of the HealthChecker.
The syntax is as follows:
runUMTool HealthChecker {-rname=<rname> | -xml=/path/to/xml1,...}
[-check=<checktype>[,<checktype> ...] ]
[-mode=<modetype>]
[-include=<checktype>[,<checktype> ...] ]
[-exclude=<checktype>[,<checktype> ...] ]
[-<additionalParameter1>=<value>] [-<additionalParameter2>=<value>] ...
Running a health check of a running realm
runUMTool HealthChecker -rname=–rname=nsp://localhost:11000
This will run the HealthChecker tool against the given running realm.
Running a health check of a stored realm configuration
runUMTool HealthChecker -xml=/path/to/xml1.xml, /path/to/xml2.xml
This will run the HealthChecker tool against the realm configurations stored in the given XML files.
The -check parameter
This parameter allows you to explicitly specify the check or checks that you want to be executed. No other checks will be included. This parameter can only be used together with the -rname or -xml parameter; the other additional parameters have no meaning in the context of -check so they can't be used.
Example - Execute only the Store Warnings Check check against the running realm:
runUMTool HealthChecker –rname=nsp://localhost:11000
-check=StoreWarningsCheck
Example - Execute only the Store Warnings Check and Fix Level Check checks against the running realm:
runUMTool HealthChecker –rname=nsp://localhost:11000
-check=StoreWarningsCheck, FixLevelCheck
The -mode parameter
This parameter allows you to select a predefined set of checks without having to name the checks explicitly. The -mode and -check parameters are mutually exclusive.
The mode parameter can take one of the following values:
default - this value selects the recommended minimal subset of checks. This is the default option.
all - this mode selects all checks.
You can use the -include and -exclude parameters to modify the set of checks selected by the -mode parameter.
If neither -mode nor -check is specified, the default set of checks will be executed.
The -include and -exclude parameters
You can use the -include and -exclude parameters to further refine the set of checks that have been selected by the -mode parameter. You can use -include and -exclude in the same call of the health checker, as long as they do not specify the same check.
include - Run all checks from the set defined by the
-mode parameter, and additionally include the check or checks specified by this parameter. The parameter may contain a single check or a comma-separated list of checks.
exclude - Run all checks from the set defined by the
-mode parameter, except the specified check or checks. The parameter may contain a single check or a comma-separated list of checks.
The -<additionalParameter> parameters
Some of the health checks allow you to specify one or more additional parameters when calling the HealthChecker. The name and purpose of each additional parameter is specific to the individual health check being run.
For example, the DurableSubscriberLargeStoreCheck check allows you to specify the additional parameter -threshold=<value>, which defines a threshold for the number of remaining events to be consumed in a shared durable.
The following general rules apply:
Each additional parameter has a default value, so if you do not specify the additional parameters explicitly, the default values will be taken.
If multiple additional parameters and multiple checks are specified, each individual check uses only its own additional parameters.
The additional parameters can be given in any order.
Checks that do not require additional parameters will ignore the additional parameters.
Syntax Examples
Example - Execute all available checks for a live realm check:
runUMTool HealthChecker –rname=nsp://localhost:11000 -mode=all
Example - Execute all the available checks for a live realm check, except the ones mentioned.
runUMTool HealthChecker –rname=nsp://localhost:11000
-mode=all –exclude=JNDIStatusCheck, FixLevelCheck, JoinMismatchCheck
Example - Execute the default set of checks for a live realm check, adding the StoreWarningsCheck which is not part of the default set.
runUMTool HealthChecker –rname=nsp://localhost:11000
-mode=default –include= StoreWarningsCheck
Example - Execute the default set of checks for a live realm check, but excluding the JNDIStatusCheck, FixLevelCheck and adding the StoreWarningsCheck.
runUMTool HealthChecker –rname=nsp://localhost:11000
-mode=default –include= StoreWarningsCheck
–exclude=JNDIStatusCheck, FixLevelCheck
Note:
The previous examples are based on live checks using the -rname parameter. The same logic applies if you use the -xml parameter instead.
Full Example
The following example compares the XML configuration files of two realms in a cluster. The realms are named realm0 and realm1, and their configuration files are named clustered_realm0.xml and clustered_realm1.xml.
XML configuration file clustered_realm0.xml for realm0:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<NirvanaRealm name="realm0" exportDate="2016-11-08Z"
comment="Realm configuration from realm0" version="BuildIdentifier"
buildInfo="BuildIdentifier">
<ClusterSet>
<ClusterEntry name="cluster1">
<ClusterMember name="realm1" rname="nsp://localhost:11010/"
canBeMaster="true"/>
<ClusterMember name="realm0" rname="nsp://localhost:11000/"
canBeMaster="true"/>
</ClusterEntry>
</ClusterSet>
<ChannelSet>
<ChannelEntry>
<ChannelAttributesEntry name="channel1" TTL="0" capacity="5" EID="0"
clusterWide="true" jmsEngine="false" mergeEngine="false"
type="PERSISTENT_TYPE"/>
<StorePropertiesEntry HonorCapacityWhenFull="false"
SyncOnEachWrite="false" SyncMaxBatchSize="0" SyncBatchTime="0"
PerformAutomaticMaintenance="false" EnableCaching="true"
CacheOnReload="true" EnableReadBuffering="true"
ReadBufferSize="10240" Priority="4" EnableMulticast="false"
StampDictionary="0" MultiFileEventsPerSpindle="50000"/>
<ChannelJoinSet>
<ChannelJoinEntry filter="" hopcount="50" to="channel2"
from="channel1" allowPurge="false" archival="false"/>
</ChannelJoinSet>
</ChannelEntry>
<ChannelEntry>
<ChannelAttributesEntry name="channel2" TTL="0" capacity="0" EID="0"
clusterWide="true" jmsEngine="false" mergeEngine="false"
type="RELIABLE_TYPE"/>
<StorePropertiesEntry HonorCapacityWhenFull="false"
SyncOnEachWrite="false"
SyncMaxBatchSize="0" SyncBatchTime="0"
PerformAutomaticMaintenance="false"
EnableCaching="true" CacheOnReload="true"
EnableReadBuffering="true"
ReadBufferSize="10240" Priority="4" EnableMulticast="false"
StampDictionary="0" MultiFileEventsPerSpindle="50000"/>
<DurableSet>
<durableEntry name="durable1" EID="-1" outstandingEvents="0"
clusterWide="true" persistent="true"
priorityEnabled="false" shared="true"/>
</DurableSet>
</ChannelEntry>
</ChannelSet>
<QueueSet>
<QueueEntry>
<ChannelAttributesEntry name="queue1" TTL="0" capacity="0" EID="0"
clusterWide="true" jmsEngine="false" mergeEngine="false"
type="RELIABLE_TYPE"/>
<StorePropertiesEntry HonorCapacityWhenFull="false"
SyncOnEachWrite="false" SyncMaxBatchSize="0" SyncBatchTime="0"
PerformAutomaticMaintenance="true" EnableCaching="true"
CacheOnReload="true" EnableReadBuffering="true"
ReadBufferSize="10240" Priority="4" EnableMulticast="false"
StampDictionary="0" MultiFileEventsPerSpindle="50000"/>
</QueueEntry>
</QueueSet>
</NirvanaRealm>
XML configuration file clustered_realm1.xml for realm1:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<NirvanaRealm name="realm1" exportDate="2016-11-16Z"
comment="Realm configuration from realm1"
version="BuildIdentifier" buildInfo="BuildIdentifier">
<ClusterSet>
<ClusterEntry name="cluster1">
<ClusterMember name="realm1" rname="nsp://localhost:11010/"
canBeMaster="true"/>
<ClusterMember name="realm0" rname="nsp://localhost:11000/"
canBeMaster="true"/>
</ClusterEntry>
</ClusterSet>
<ChannelSet>
<ChannelEntry>
<ChannelAttributesEntry name="channel1" TTL="0" capacity="5" EID="0"
clusterWide="true" jmsEngine="false" mergeEngine="false"
type="RELIABLE_TYPE"/>
<StorePropertiesEntry HonorCapacityWhenFull="false"
SyncOnEachWrite="false" SyncMaxBatchSize="0" SyncBatchTime="0"
PerformAutomaticMaintenance="false" EnableCaching="true"
CacheOnReload="true" EnableReadBuffering="true"
ReadBufferSize="10240" Priority="4" EnableMulticast="false"
StampDictionary="0" MultiFileEventsPerSpindle="50000"/>
<ChannelJoinSet>
<ChannelJoinEntry filter="" hopcount="10" to="channel2"
from="channel1" allowPurge="false" archival="false"/>
</ChannelJoinSet>
</ChannelEntry>
<ChannelEntry>
<ChannelAttributesEntry name="channel2" TTL="0" capacity="0" EID="0"
clusterWide="true" jmsEngine="false" mergeEngine="false"
type="RELIABLE_TYPE"/>
<StorePropertiesEntry HonorCapacityWhenFull="false"
SyncOnEachWrite="false" SyncMaxBatchSize="0" SyncBatchTime="0"
PerformAutomaticMaintenance="false" EnableCaching="true"
CacheOnReload="true" EnableReadBuffering="true"
ReadBufferSize="10240" Priority="4" EnableMulticast="false"
StampDictionary="0" MultiFileEventsPerSpindle="50000"/>
</ChannelEntry>
</ChannelSet>
<DataGroupSet>
<DataGroupEntry>
<DataGroupAttributesEntry name="dg1" id="3422373812" priority="1"
multicastenabled="false"/>
</DataGroupEntry>
</DataGroupSet>
</NirvanaRealm>
From a first analysis we can say that these two realms belong to the same cluster (cluster1) and that they both contain various stores, joins and data groups. But let's see what happens when we run the HealthChecker tool specifying both XML files and running all the checks. Note that we need to exclude the ServerProtectionConsistencyCheck since the specified XML files do not contain the RealmConfiguration section.
Here is the call of the tool (using Windows syntax) and the result:
runUMTool.bat HealthChecker -xml=clustered_realm0.xml,clustered_realm1.xml
-exclude=ServerProtectionConsistencyCheck
HealthChecker Tool - Version: 1.0
XML JOIN MISMATCHES CHECK
ERROR: Join from (channel1) to (channel2) HopCount mismatch [realm1] does not
equal [realm0]
XML JNDI PROPERTIES CHECK
WARN: Realm realm0: No JNDI entry for store channel1
WARN: Realm realm0: No JNDI entry for store channel2
WARN: Realm realm0: No JNDI entry for store queue1
WARN: Realm realm1: No JNDI entry for store channel1
WARN: Realm realm1: No JNDI entry for store channel2
XML DURABLE STATUS CHECK
ERROR: Could not find durable (durable1) on realm [realm1] but it is present
on [realm0]
XML STORE MISMATCHES CHECK
WARN: Store (channel1) Type mismatch [realm1] does not equal [realm0]
ERROR: Could not find store (queue1) on realm [realm1] but it is present
on realm [realm0]
XML DATAGROUP MISMATCHES CHECK
ERROR: Could not find Data Group (dg1) on realm [realm0] but it is present
on realm [realm1]
These errors and warnings tell us:
Joins: a join between two clusterwide channels has to be the same on all the nodes - in our case there is a mismatch in the HopCount;
JNDIs: these simple warnings are saying: "Are you sure that you don't need any JNDI for these stores?";
Durables: if a durable is clusterwide, then it has to be present on all the other nodes (and it has to be the same);
Stores: if a store is clusterwide, then it has to be the same on all the other nodes.
Datagroups: the same rule applies for datagroups, which are always clusterwide and have to be present on all the other nodes.