Application Metrics
Monitor the following metrics to analyze API Data Store health.
Index size
Cluster health
Number of shards
GC monitoring
Note:
The threshold values, configurations, and severities that are mentioned throughout this section are the guidelines that Software AG suggests for an optimal performance of API Data Store. You can modify these threshold values or define actions based on your operational requirements.
For details about how to generate thread dump and heap dump, see
Troubleshooting: Monitoring API Data Store.
If the metrics return an exceeded threshold value, consider the severity as mentioned and perform the possible actions that Software AG recommends to identify and debug the problem and contact Software AG for further support.
Index Size
Storing all data in a single index will slow down Elasticsearch's performance. Hence, the data must be split into multiple smaller indexes and stored. Advantages of small indexes include:
Faster start-up of Elasticsearch. Multiple smaller indexes instead of one huge index allows Elasticsearch to start up faster.
Faster response. When you store all data in a single index, then Elasticsearch slows down since it spends a lot of time in shard allocation. Chunking of data in smaller units helps in avoiding this time consumption.
Each index has two divisions; the primary shard and the replica shard. The data is first stored in primary shard. Elasticsearch replicates the data in the primary shard as replica shard. For example, when you allot 25 GB for an index, the space is equally divided for both divisions of an index. As per the example, the size of all indexes total up to a maximum of 300 GB. That is, 150 GB is for primary data and the 150 GB for replica shards. Replication of primary data enables Elasticsearch to make it highly available.
When data on a particular index exceeds a certain limit, it is essential to roll over the index and create a new index. The acceptable size limit of an index depends on its type. Software AG recommends that you specify 25 GB (12.5 for each shard) for the transactional events indexes and 5 GB (2.5 for each shard) for tracer indexes. For the list of tracer indexes, see
List of Indexes that can be included in backup.
It is essential to monitor the transactional events indexes to prevent them exceeding 25 GB of size. For information on calculating index size, see
Calculating index size.
You must rollover an index when the size of the primary shard is 12.5 GB. That is, if the size of the primary index is 12.5 GB, then the size of replica will also be 12.5 GB. Hence, you must monitor the size of primary index and perform rollovers as and when required.
When you rollover an index, a new index is created with a primary and a replica for each shard. The naming convention of the new index is
Index_name_YYYYMMDDHHMM. For example,
gateway_default_analytics_transactionalEvents_YYYYMMDDHHMM. For information on creating a rollover, see
Creating Rollover of an Index.
Calculating index size
The query used to calculate the index size returns the primary shard of an index. Hence, you must calculate the actual index size by multiplying the returned size by two. For example, if you want to purge indexes that are beyond 25 GB, then you must purge the indexes whose size are 12.5 GB.
1. Run the following command:
http://localhost:9240/_cat/indices/gateway_tenant_index_name?
v&s=i&format=json&pretty
For example,
http://localhost:9240/_cat/indices/
gateway_default_analytics_transactionalevents_1639736462002-000001?
v&s=i&format=json&pretty
Sample output.:
[
{
"health" : "yellow",
"status" : "open",
"index" : "gateway_default_analytics_transactionalevents_1639736462002-000001",
"uuid" : "2tmWIIAcQ1KeSqIg9iPU0g",
"pri" : "5",
"rep" : "1",
"docs.count" : "663",
"docs.deleted" : "0",
"store.size" : "909.8kb",
"pri.store.size" : "909.8kb"
}
]
API Data Store Cluster Health
To ensure optimal health and performance of API Data Store, Software AG recommends monitoring the API Data Store cluster health regularly.
Command | Description |
curl -X GET http://localhost:9240/_cluster/health?pretty | This command retrieves API Data Store cluster health status. |
$.status | This JSON path expression retrieves the cluster health status from the response. |
$.number_of_nodes | This JSON path expression retrieves the number of nodes in the cluster from the response. |
The response JSON of the health check request displays a status field in the response. The status can have the values green, yellow or red. The cluster health status is displayed based on the following color codes:
Status | Description |
green | Indicates that the cluster is in a healthy state. When API Data Store is handling huge data, it takes some time to display the cluster health status. |
yellow | Indicates that the cluster is not in a healthy state. Identify the cause and rectify it. During this time, API Data Store processes the requests for the index that is available. If there are unassigned shards, then identify the unassigned shards, check the reason for the unallocation and resolve the issue. Run the following command to retrieve the list of unassigned shards. curl -X GET “http://localhost:9240/_cat/ shards?h=index,shard,primaryOrReplica,state,docs,store,ip,node,segments.count,unassigned.at,unassigned.details,unassigned.for,unassigned.reason,help,s=index&v” Run the following command to check the unallocated reason for specific shards. curl -X GET "http://localhost:9240/_cluster/allocation/ explain" -d ‘{ "index" :"index name","primary" : "true|false","shard": "shardnumber"}’reason,help,s=index&v” |
red | Indicates that API Data Store nodes are down or not reachable or the API Data Store master is not discovered. If the number of nodes does not match the number of API Data Store nodes configured, identify the node that did not join the cluster and identify the root cause for the node to not join the cluster. Based on the root cause, identify if your API Data Store is down. If your API Data Store is down and not reachable, check the connectivity. |
A sample HTTP response is as follows:
{
"cluster_name": "SAG_apidatastore_cluster",
"status": "green",
"timed_out": false,
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 101,
"active_shards": 202,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100.0
}
The overall cluster status is green since all API Data Store nodes work as expected.
Number of shards
To ensure proper allocation of shards to nodes, Software AG recommends to monitor the number of shards regularly.
Command | Description |
curl -X GET "http://localhost:9240/_cluster/health?pretty" | This command retrieves the number of shards on API Data Store. If the total number of active shards from the response exceeds the heap space * nodes * 20 count, then increase the heap space of API Data Store nodes or add a new API Data Store node. For more information on adding a new API Data Store node, see
Adding New Nodes to an Elasticsearch Cluster. API Data Store considers a maximum of 20 active shards per GB of heap space as healthy. Perform any of the following actions to maintain the total number of active shards: Scale up the API Data Store node. If you are not able to scale up the API Data Store node, then increase the heap size as the last option. The heap space should not be more than half of system memory (RAM). For example, if the system memory is 16 GB, you can allocate a maximum of 8 GB for API Data Store. To increase the heap space, modify the parameters Xms2g and Xmx2g in the jvm.options file located at SAG_Install_Directory\InternalDataStore\config. |
Garbage Collection (GC) Monitoring
The GC metric provides the GC run-time in seconds. You must check GC run-time once every five minutes. The average GC run-time should not exceed one second.
Metric | Description |
elasticsearch_jvm_gc_collection_seconds_sum elasticsearch_jvm_gc_collection_seconds_count | The quotient of both the metrics gives the GC run time. If the quotient is more than 1 second, it implies that GC is taking longer time to run and this slows down API Data Store request processing. You must collect the logs and get the mapping of API index and transaction index. |