API Gateway 10.11 | Administering API Gateway | Operating API Gateway | Monitoring API Gateway | Node-level Monitoring | Application Monitoring | Monitoring API Data Store | Infrastructure Metrics
 
Infrastructure Metrics
Infrastructure metrics include system metrics and container metrics. For information about container metrics, see Container Metrics.
System Metrics
Monitor the following system metrics to analyze API Data Store health.
*CPU usage
*Disk usage
*Memory usage
Monitor the CPU usage
To ensure that CPU is not over utilized, you must monitor the CPU health regularly. You can monitor the CPU usage at two levels: process level and OS level. If the process level CPU is utilized beyond the threshold limits, you can share the load. However, if the OS level CPU has reached its limits, you must contact your IT team.
Command / Metric
Description
curl -X GET http://localhost:9240​/_nodes/stats/process?pretty
This command retrieves the CPU utilization by the API Data Store pods.
$.nodes.nodeid​.process.cpu.percent
This JSON path expression retrieves the percentage of CPU usage by an API Data Store pod.
If a pod is using 80% of the CPU space for more than 15 minutes, consider the severity as WARNING and perform the following steps to identify the causes of higher CPU usage.
1. Identify the process that consumes the highest CPU.
2. Generate the thread dump.
3. Analyze the thread dump to identify the thread locks.
If a pod is using 90% of the CPU space for more than 15 minutes, look for the following Prometheus metrics:
*elasticsearch_os_cpu_percent
*elasticsearch_process_cpu_percent
elasticsearch_os_​cpu_percent
If elasticsearch_os_cpu_percent is more than 90%, consider the severity as CRITICAL and perform the following steps to identify the causes of higher CPU usage.
1. Restart the pod.
2. Check the readiness and liveliness of the pod.
elasticsearch_process_​cpu_percent
If elasticsearch_process_cpu_percent is more than 90%, consider the severity as CRITICAL and add a new node to the cluster. To learn more about how to add a new API Data Store node, see Adding New Nodes to an Elasticsearch Cluster.
Monitor the Disk usage
To ensure that all nodes have enough disk space, Software AG recommends to monitor the disk space regularly.
Command
Description
curl -X GET http://localhost:9240​/_nodes/stats/fs
This command retrieves the disk space of the API Data Store nodes. It lists the disk space available in all nodes.
For more information about Elasticsearch node statistics, see Elasticsearch documentation.
$.nodes..fs.total.total_in_bytes
This JSON path expression retrieves the total disk space.
$.nodes..fs.total.free_in_bytes
This JSON path expression retrieves the free disk space.
.nodes..fs.total.available_in_bytes
This JSON path expression retrieves the available disk space.
Disk-based shard allocations
Note:
500GB(HA) / 150GB(single node) is used as an example for maximum data retention here.
Command
Description
curl -X GET http://localhost:9240​/_cluster/settings?pretty
This command retrieves the configured disk-based shard allocations in API Data Store. To learn more about disk-based shard allocations, see Elasticsearch documentation.
The shard allocation is based on the thresholds known as Low, High, and Flood watermark.
Shard allocation: Low
$.persitent.cluster.routing​.allocation.disk.​watermark.low
The default threshold for this level is 80%. Once the threshold is reached, API Data Store does not allocate new shards to nodes that have used more than 80% disk space. You can calculate if the disk usage is low by using the expression ( average disk usage of the API Data Store cluster / standalone). If the result of this expression exceeds the defined threshold (80%), the disk has reached the Low stage. If your disk usage has reached low, perform the following steps:
1. Query the transaction event index size and verify if the index is above 525 GB(HA) / 175 GB(single node). If it is already breached, monitor if the purge scripts are running and the index size is decreasing.
2. Verify if the size of each transaction event index is equal to the sum of used space (range of 25 GB). If this does not match, some other external items like increased logs size or heap dump are occupying a lot of space. Clear the logs and heap dump.
3. Repeat the above steps until the transaction event index is less than 525 GB and the average disk usage of the cluster becomes less than 80%.
Shard allocation: High
$.persitent.cluster.routing​.allocation.disk.​watermark.high
The default threshold for this level is 85%. Once the threshold is reached, API Data Store attempts to relocate shards away from a node whose disk usage is above 85%. You can calculate if the disk usage is low by using the expression ( average disk usage of the API Data Store cluster / standalone). If the result of this expression exceeds the defined threshold (85%), the disk has reached the High stage. If your disk usage has reached High, perform the following steps:
1. Query the transaction event index and verify if the index is above 525 GB(HA) / 175 GB(single node). If it is already breached, monitor if the purge scripts are running and the index size is decreasing.
2. Verify if the size of each transaction event index is equal to the sum of used space (range of 25 GB). If this does not match, some other external items like increased logs size or heap dump are occupying a lot of space. Clear the logs and heap dump.
3. Repeat the above steps until the transaction event index is less than 525 GB and the average disk usage of the cluster becomes less than 85%
Shard allocation: Flood
$.persitent.cluster.routing​.allocation.disk.​watermark.flood
The default threshold for this level is 90%. Once the threshold is reached, API Data Store enforces a read-only index block (index.blocks.read_only_allow_delete) on every index that has one or more shards allocated on the node that has at least one disk exceeding the flood stage. This is the last resort to prevent nodes from running out of disk space.
You can calculate if the disk usage is in flood stage, by using the expression ( average disk usage of the API Data Store cluster / standalone). If the result of this expression exceeds the defined threshold (90%), the disk is in the Flood stage. If your disk usage has reached the Flood stage, perform the following steps:
*Monitor the purging of data and ensure the purging happens and the disk space occupancy is reduced.
*If this situation is due to a spike in the requests count and size, follow up with the customer to understand the reason for the sudden spike and inform the customer to compress the payload for transaction logging or not to store the request or response.
curl -X GET http://localhost:9240​/_nodes/stats/metric
This command retrieves information about specific metrics like fs, http, os, process, and so on.
For more information about the corresponding metrics, see Elasticsearch documentation.
Monitor the Memory usage
Command
Description
http://HOST:9240/​_nodes/nodeid/stats/os
This URL retrieves the memory status utilized by the API Data Store pods.
http:URL/nodes?​v&full_id=true&h=id,name,ip
This URL retrieves the node id of the corresponding API. This returns the node id, node name, and the node IP address.
$.nodes.nodeid​.os.mem.free_percent
This JSON expression retrieves the percentage of memory that is free.
If a pod is using 85% of the available memory, consider the severity as WARNING, and identify the process that consumes more memory and generate the heap dump.
If a pod is using 90% of the available memory, consider the severity as CRITICAL, and perform the following steps to identify the reason.
1. Identify the process that consumes more memory.
2. Generate the heap dump.
3. Restart the pod.
4. Check the readiness and liveliness of the pod.