Cluster-level Monitoring

API Gateway 10.11 | Administering API Gateway | Operating API Gateway | Monitoring API Gateway | Cluster-level Monitoring

Troubleshooting: Cluster-level Monitoring

The cluster-level monitoring ensures service availability, that is, availability of access and functionality (serving API requests) of API Gateway. Through cluster-level monitoring, you can check:

If the runtime is available and ready to serve the traffic.

If the API Gateway administrator console is accessible.

How do I monitor the cluster health of API Gateway?

You can set up the Readiness Probe, Runtime Service Health Probe, and Administration Service Health Probe to monitor the cluster health.

Requirement	Type of Impact	Solution
For API Gateway, is there an endpoint that returns yes or no about its service availability, that is, readiness for serving the incoming API requests?	Business Impact. To know if there is an outage in API Gateway.	Use Readiness Probe.
For API Gateway, is there an endpoint that indicates the availability of the administrator user consoles?	Operational Impact. To know if the administrator user console is available.	Use Administration Service Health Probe.
For API Gateway, is there an endpoint that indicates the cluster health and its details?	Technical Impact. To know the details about where the fault lies when there is a cluster failure.	Use Runtime Service Health Probe.

How do the probes help in cluster-level monitoring?

	Readiness Probe	Runtime Service Health Probe	Administration Service Health Probe
What is it?	Indicates if the traffic-serving port of API Gateway is ready to accept requests.	Reports on the overall cluster health and indicates if the components of API Gateway are in an operational state.	Indicates if the API Gateway administrator console is available and accessible.
When is it used?	To continuously check and report on the service availability of API Gateway.	To continuously report on the cluster health with the details of the components involved in clustering.	To continuously report on the availability of the administrator console and API analytics.

Note:
The points in the table are also applicable to scenarios where the cluster health is NOT OK, for example, API Data Store or Terracotta failure. Such scenarios do not always mean an outage. API Gateway may still be able to process the requests.

How do I set up probes?

Prerequisites:

You must have a valid API Gateway user credential for using the Readiness Probe, Runtime Service Health Probe, and Administration Service Health Probe.

All the cluster-level probes must be setup to target API Gateway load balancer endpoint.

Software AG recommends to set up a dedicated port for monitoring with an appropriate private thread pool.

Readiness Probe at Cluster-Level

To monitor the readiness of API Gateway, that is to check if API Gateway is ready to accept the requests, use the following REST endpoint:

GET /rest/apigateway/health

The following table shows the response code and the description.

Response	Description
200 OK	Readiness check is successful. Readiness probe continues to reply OK if API Gateway remains in an operational state to serve the requests.
500 Internal server error	Readiness check failed and denotes a problem.
timeout or no response as the request did not reach the probe	Several factors can contribute to the delay when the Readiness Probe initiates, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

Note:
As this is a Readiness Probe and only the response status code is essential, by design, JSON payload is not returned in the response for both success and failure scenarios.

Runtime Service Health Probe at Cluster-Level

To monitor the runtime service health of API Gateway, that is to check the cluster health of API Gateway, use the following REST endpoint:

GET /rest/apigateway/health/engine

The following table shows the response code and the description.

Response	Description
200 OK	Runtime service health check is successful.
500 Internal server error	Runtime service health check failed and denotes a problem. The response JSON indicates the problem.
timeout or no response as the request did not reach the probe	Several factors can contribute to the delay when the Runtime Service Health Probe initiates, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

The response JSON of each health check request displays a status field in the response.

The overall status of API Gateway cluster can be green, yellow, and red.

Status	Description
green	Indicates that the cluster is in a healthy state.
yellow	Indicates that API Gateway does not have adequate resources to run.
red	Indicates the cluster failure and an outage.

The overall status of API Gateway cluster is assessed based on the API Data Store status, API Gateway resource status, and the cluster status within nodes.

API Data Store status

Status	Description
green	Indicates that API Data Store is in a healthy state. When the status of API Data Store signals green or yellow, the overall status of API Gateway is green.
red	Indicates cluster failure and an outage. When the status of API Data Store signals red, the overall status of API Gateway is red.
yellow	Indicates a node failure in the cluster. However, the cluster is still functioning and operational.

API Gateway resource status

Status	Description
green	Indicates that API Gateway resource types like memory, disk space, and service threads are available to run.
yellow	Indicates that API Gateway does not have adequate resources to run. When the API Gateway resource status is yellow, the overall status of API Gateway is yellow.

Cluster status within nodes

Status	Description
green	Indicates that cluster is in a healthy state. The cluster status is green only when Terracotta Server Array is up and running. When the status of the cluster signals green, the overall status of API Gateway is green.
red	Indicates cluster failure and an outage. When the status of the cluster signals red, the overall status of API Gateway is red.

A sample HTTP response is as follows:

{

"status": "green",

"elasticsearch": {

"cluster_name": "api_gateway_cluster",

"status": "green",

"number_of_nodes": "3",

"number_of_data_nodes": "3",

"timed_out": "false",

"active_shards": "200",

"initializing_shards": "0",

"unassigned_shards": "0",

"task_max_waiting_in_queue_millis": "0",

"node": "localhost:9240",

"response_time_ms": "4"

},

"is": {

"status": "green",

"diskspace": {

"status": "up",

"free": "14206386176",

"inuse": "17994313728",

"threshold": "3220069990",

"total": "32200699904"

},

"memory": {

"status": "up",

"freemem": "420766624",

"maxmem": "2147483648",

"threshold": "161061273",

"totalmem": "1610612736"

},

"servicethread": {

"status": "up",

"avail": "397",

"inuse": "3",

"max": "400",

"threshold": "40"

},

"response_time_ms": "309"

},

"cluster": {

"status": "green",

"isClusterAware": "true",

"nodes": "3",

"response_time_ms": "518"

}

}

The overall cluster status of API Gateway is green since all components work as expected.

Administration Service Health Probe at Cluster Level

To check the availability and health status of the API Gateway administration service (UI, Dashboards) at the cluster level, use the following REST endpoint:

GET /rest/apigateway/health/admin

The following table shows the response code and the description.

Response	Description
200 OK	Administration service health check is successful.
500 Internal server error	Denotes a problem. The response JSON indicates the problem.
timeout or no response as the request did not reach the probe	Several factors can contribute to the delay when you initiate the Administration Service Health Probe, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

The overall Administration Service Health Probe status can be green or red based on the API Gateway administration service's status and Kibana status.

Kibana status

Status	Description
green	Indicates that Kibana's port is accessible. When the status signals green, the overall status of Administration Service Health Probe is green.
red	Indicates that either Kibana's port is inaccessible or Kibana's communication with API Data Store is not established. When the status signals red, the overall status of Administration Service Health Probe is red.

API Gateway administration service status

Status	Description
green	Indicates that API Gateway administration service is available. When the status signals green, the overall status of Administration Service Health Probe is green.
red	Indicates that API Gateway administration service is not available. When the status signals red, the overall status of Administration Service Health Probe is red.

A sample HTTP response is as follows:

{
"status": "green",
"ui": {
"status": "green",
"response_time_ms": "40"
},
"kibana": {
"status": {
"overall": {
"state": "green",
"nickname": "Looking good",
"icon": "success",
"uiColor": "secondary"
}
},
"response_time_ms": "36"
}
}

The overall status is green since API Gateway administration service and Kibana is in a healthy state.