Health Monitor Plugin

The Health Monitor plugin adds an HTTP REST endpoint to the URL of the session to which the realm server is connected. This allows clients to query the current state of the realm server. The endpoint defines the "liveness" of the server. The plugin returns the result of the health checks that run on the server at periodic intervals.

Using the Add Plugin feature in the Comms > Interfaces > Plugins dialog of the Enterprise Manager. In this case, you define the name of the new URL endpoint in the URL Path field of the Add Plugin dialog.

See the section Plugins for details.

Using the command line tool AddHealthMonitorPlugin. In this case, you define the name of the new URL endpoint using the -mountpath argument of the tool.

For details of running this command line tool, see the section Syntax: Miscellaneous Tools.

The realm server runs four different tasks at regular intervals on the server to monitor the health status of the server:

This task monitors the memory status of the server and will produce an alert/error as soon as any memory related issues are found. The task checks the heap and direct memory usage, and if the usage exceeds a threshold value of 95%, then it is considered as an error and the error will be reported and logged.

The server is not considered to be unhealthy when the first such error occurs; instead a server is considered unhealthy only if the memory monitor task returns an error 3 times consecutively.

If any thread pool has more than 5 stalled tasks, this is considered an error, but the status is only reported as unhealthy if the error occurs 5 times consecutively.

If a server is configured to be part of a cluster, and the last time that the server successfully joined the cluster is more than 600000 milliseconds (10 minutes) ago, then the server is considered to be unhealthy.

The server round trip checks the processing time in a cluster. nClusterRoundTripEvent events are synchronous events that measure the processing time in a cluster. The realm server sends these events into the cluster and records the time it takes to complete the processing in cluster. If the event takes more than 30 seconds to complete processing and get acknowledged, then that is considered to be an error. Not getting an acknowledgement back for this event is also considered to be an error. If 5 consecutive such errors, the server is considered to be unhealthy.

If the server is fully operational, and is an active member of the cluster (if a cluster is configured), the query returns a response "OK" of the following form:

Even if the return code is "OK", the response can contain additional information (in JSON format). This information can be in the form of useful statistics, or as an indication that the server is approaching certain limits.

If the server is not fully operational, the query returns a status "ERROR" with an appropriate description of the problem, for example:

{"ServerStatus":"ERROR","ServerStatusDetails":
{"MemoryHealthMonitor":
"Max threshold of used Heap memory is exceeded, Heap memory used - 338 MB"
}
}