Infrastructure Metrics

Command / Metric	Description
curl -X GET http://localhost:9240/_nodes/stats/process?pretty	This command retrieves the CPU utilization by the API Data Store pods.
$.nodes.nodeid.process.cpu.percent	This JSON path expression retrieves the percentage of CPU usage by an API Data Store pod. If a pod is using 80% of the CPU space for more than 15 minutes, consider the severity as WARNING and perform the following steps to identify the causes of higher CPU usage. 1. Identify the process that consumes the highest CPU. 2. Generate the thread dump. 3. Analyze the thread dump to identify the thread locks. If a pod is using 90% of the CPU space for more than 15 minutes, look for the following Prometheus metrics: elasticsearch_os_cpu_percent elasticsearch_process_cpu_percent
elasticsearch_os_cpu_percent	If elasticsearch_os_cpu_percent is more than 90%, consider the severity as CRITICAL and perform the following steps to identify the causes of higher CPU usage. 1. Restart the pod. 2. Check the readiness and liveliness of the pod.
elasticsearch_process_cpu_percent	If elasticsearch_process_cpu_percent is more than 90%, consider the severity as CRITICAL and add a new node to the cluster. To learn more about how to add a new API Data Store node, see Adding New Nodes to an Elasticsearch Cluster.

Command	Description
curl -X GET http://localhost:9240/_nodes/stats/fs	This command retrieves the disk space of the API Data Store nodes. It lists the disk space available in all nodes. For more information about Elasticsearch node statistics, see Elasticsearch documentation.
$.nodes..fs.total.total_in_bytes	This JSON path expression retrieves the total disk space.
$.nodes..fs.total.free_in_bytes	This JSON path expression retrieves the free disk space.
.nodes..fs.total.available_in_bytes	This JSON path expression retrieves the available disk space.

Command	Description
curl -X GET http://localhost:9240/_cluster/settings?pretty	This command retrieves the configured disk-based shard allocations in API Data Store. To learn more about disk-based shard allocations, see Elasticsearch documentation. The shard allocation is based on the thresholds known as Low, High, and Flood watermark.
Shard allocation: Low $.persitent.cluster.routing.allocation.disk.watermark.low	The default threshold for this level is 80%. Once the threshold is reached, API Data Store does not allocate new shards to nodes that have used more than 80% disk space. You can calculate if the disk usage is low by using the expression ( average disk usage of the API Data Store cluster / standalone). If the result of this expression exceeds the defined threshold (80%), the disk has reached the Low stage. If your disk usage has reached low, perform the following steps: 1. Query the transaction event index size and verify if the index is above 525 GB(HA) / 175 GB(single node). If it is already breached, monitor if the purge scripts are running and the index size is decreasing. 2. Verify if the size of each transaction event index is equal to the sum of used space (range of 25 GB). If this does not match, some other external items like increased logs size or heap dump are occupying a lot of space. Clear the logs and heap dump. 3. Repeat the above steps until the transaction event index is less than 525 GB and the average disk usage of the cluster becomes less than 80%.
Shard allocation: High $.persitent.cluster.routing.allocation.disk.watermark.high	The default threshold for this level is 85%. Once the threshold is reached, API Data Store attempts to relocate shards away from a node whose disk usage is above 85%. You can calculate if the disk usage is low by using the expression ( average disk usage of the API Data Store cluster / standalone). If the result of this expression exceeds the defined threshold (85%), the disk has reached the High stage. If your disk usage has reached High, perform the following steps: 1. Query the transaction event index and verify if the index is above 525 GB(HA) / 175 GB(single node). If it is already breached, monitor if the purge scripts are running and the index size is decreasing. 2. Verify if the size of each transaction event index is equal to the sum of used space (range of 25 GB). If this does not match, some other external items like increased logs size or heap dump are occupying a lot of space. Clear the logs and heap dump. 3. Repeat the above steps until the transaction event index is less than 525 GB and the average disk usage of the cluster becomes less than 85%
Shard allocation: Flood $.persitent.cluster.routing.allocation.disk.watermark.flood	The default threshold for this level is 90%. Once the threshold is reached, API Data Store enforces a read-only index block (index.blocks.read_only_allow_delete) on every index that has one or more shards allocated on the node that has at least one disk exceeding the flood stage. This is the last resort to prevent nodes from running out of disk space. You can calculate if the disk usage is in flood stage, by using the expression ( average disk usage of the API Data Store cluster / standalone). If the result of this expression exceeds the defined threshold (90%), the disk is in the Flood stage. If your disk usage has reached the Flood stage, perform the following steps: Monitor the purging of data and ensure the purging happens and the disk space occupancy is reduced. If this situation is due to a spike in the requests count and size, follow up with the customer to understand the reason for the sudden spike and inform the customer to compress the payload for transaction logging or not to store the request or response.
curl -X GET http://localhost:9240/_nodes/stats/metric	This command retrieves information about specific metrics like fs, http, os, process, and so on. For more information about the corresponding metrics, see Elasticsearch documentation.

Command	Description
http://HOST:9240/_nodes/nodeid/stats/os	This URL retrieves the memory status utilized by the API Data Store pods.
http:URL/nodes?v&full_id=true&h=id,name,ip	This URL retrieves the node id of the corresponding API. This returns the node id, node name, and the node IP address.
$.nodes.nodeid.os.mem.free_percent	This JSON expression retrieves the percentage of memory that is free. If a pod is using 85% of the available memory, consider the severity as WARNING, and identify the process that consumes more memory and generate the heap dump. If a pod is using 90% of the available memory, consider the severity as CRITICAL, and perform the following steps to identify the reason. 1. Identify the process that consumes more memory. 2. Generate the heap dump. 3. Restart the pod. 4. Check the readiness and liveliness of the pod.