Terracotta 10.15 | Terracotta Management and Monitoring | Prometheus Integration
 
Prometheus Integration
https://prometheus.io/ is an open-source systems monitoring and alerting toolkit which can be used alongside products like https://grafana.com/ for interactive data visualization and analytics. Terracotta provides a list of key Terracotta metrics in Prometheus compatible format over HTTP on the TMS (Terracotta Management Server) endpoint:
http(s)://[host]:[port]/actuator/prometheus
For example, if the Terracotta Management Server (TMS) is available at http://localhost:9480, and you have configured a cluster connection within its interface, then the prometheus metrics can be accessed at http://localhost:9480/actuator/prometheus.
Available metrics
All the available Terracotta metrics are prefixed with sag_tc_
Server Side Metrics
These are the same metrics as you would find under in the Using the Server Tab section.
Server Specific Resource Usage Metrics
Prometheus Metric Name
Metric Description
Type
sag_tc_server_dataroot_total_disk_usage_bytes
Dataroot total disk usage in bytes.
Gauge.
sag_tc_server_offheap_allocated_memory_bytes
Offheap memory allocated in bytes.
Gauge.
sag_tc_server_restartable_store_total_usage_bytes
FRS usage in bytes.
Gauge.
Server-side Caching Specific Resource Usage Metrics
Prometheus Metric Name
Metric Description
Type
sag_tc_server_caching_pool_allocated_size_bytes
Caching pool allocated size in bytes.
Gauge.
sag_tc_server_caching_store_allocated_memory_bytes
Caching store allocated memory in bytes.
Gauge.
sag_tc_server_caching_store_data_size_bytes
Caching store data size in bytes.
Gauge.
sag_tc_server_caching_store_entries_count
Number of Caching store entries.
Gauge.
Server-side Store Specific Resource Usage Metrics
Prometheus Metric Name
Metric Description
Type
sag_tc_server_dataset_main_record_occupied_storage_bytes
Total occupied storage by the dataset in bytes - this is the sum of the 3 dataset_occupied metrics below
Gauge.
sag_tc_server_dataset_occupied_primary_key_bytes
Gauge.
sag_tc_server_dataset_occupied_persistent_support_bytes
Gauge.
sag_tc_server_dataset_occupied_heap_bytes
Gauge.
sag_tc_server_dataset_allocated_memory_bytes
Total allocated storage by the dataset in bytes - this is the sum of the 4 dataset_allocated metrics below
Gauge.
sag_tc_server_dataset_allocated_primary_key_bytes
Gauge.
sag_tc_server_dataset_allocated_persistent_support_bytes
Gauge.
sag_tc_server_dataset_allocated_heap_bytes
Gauge.
sag_tc_server_dataset_allocated_index_bytes
Gauge.
sag_tc_server_dataset_index_access_count
Dataset index access count.
Counter.
sag_tc_server_dataset_index_occupied_storage_bytes
Dataset index occupied storage in bytes.
Gauge.
sag_tc_server_dataset_index_record_count
Dataset index record count.
Gauge.
sag_tc_server_dataset_record_count
Dataset record count.
Gauge.
All the exposed server metrics have the following labels:
Server Metric Label
Label Description
alias
This label can represent a server-side cache resource name, an offheap resource name, a dataroot name, or a dataset name.
connection_name
Name of the connection set by the user in TMC web application.
entity_name
A technical attribute that represents the server-side entity name.
entity_type
A technical attribute that represents the server-side entity type.
server
Represents the server name.
stripe
Represents the stripe name.
cluster_tier_manager
For caching resource only. Matches the alias of the entity given by the user when connecting to a clustered cache, e.g. - terracotta://myhost:9410/anEntity
Example
sag_tc_server_caching_pool_allocated_size_bytes{alias="cache1",
cluster_tier_manager="CacheManager1", connection_name="MyCluster",
entity_name="CacheManager1$cache1", entity_type="cache_cluster_tier",
instance="localhost:9480", job="terracotta", server="stripe-1-server-1",
stripe="stripe-1"} 2228224
This example represents server side cache store allocated size in bytes created by the cache "_cache1_" under cluster_tier_manager named "_CacheManager1_" in server named "_stripe-1-server-1_" for connection called "_MyCluster_".
Cache Metrics
These are the same metrics as you would find under the Using the Ehcache Tab section.
Prometheus Metric Name
Metric Description
Type
sag_tc_cache_get_hit_latency_100_percentile
Latency maxima of successful cache.get(key) operations (hits).
Gauge.
sag_tc_cache_get_hit_latency_95_percentile
95th percentile of latencies of successful cache.get(key) operations (hits).
Gauge.
sag_tc_cache_get_hit_latency_99_percentile
99th percentile of latencies of successful cache.get(key) operations (hits).
Gauge.
sag_tc_cache_get_miss_latency_100_percentile
Latency maxima of misses of cache.get(key) operations.
Gauge.
sag_tc_cache_get_miss_latency_95_percentile
95th percentile of latencies of misses of cache.get(key) operations.
Gauge.
sag_tc_cache_get_miss_latency_99_percentile
99th percentile of latencies of misses of cache.get(key) operations.
Gauge.
sag_tc_cache_put_latency_100_percentile
Latency maxima of successful cache.put(key, val) operations.
Gauge.
sag_tc_cache_put_latency_95_percentile
95th percentile of latencies of successful cache.put(key, val) operations.
Gauge.
sag_tc_cache_put_latency_99_percentile
99th percentile of latencies of successful cache.put(key, val) operations.
Gauge.
sag_tc_cache_remove_latency_100_percentile
Latency maxima of successful cache.remove(key) operations.
Gauge.
sag_tc_cache_remove_latency_95_percentile
95th percentile of latencies of successful cache.remove(key) operations
Gauge.
sag_tc_cache_remove_latency_99_percentile
99th percentile of latencies of successful cache.remove(key) operations
Gauge.
sag_tc_cache_hit_count_total
Total times a get command returned a value.
Counter.
sag_tc_cache_miss_count_total
Total times a get command did not return a value.
Counter.
sag_tc_cache_put_count_total
Total number of puts to the cache.
Counter.
sag_tc_cache_removal_count_total
Total number of removes from the cache.
Counter.
sag_tc_clustered_hit_count_total
Total number of get commands that returned a value from the cluster tier.
Counter.
sag_tc_clustered_miss_count_total
Total number of get commands that failed to return a value from the cluster tier.
Counter.
sag_tc_clustered_put_count_total
Total number of puts to the cluster tier.
Counter.
sag_tc_clustered_removal_count_total
Total number of removes from the cluster tier.
Counter.
Cache Metric Label
Label Description
cache
Name of the cache.
cache_manager
Name of the cache manager.
client
Gives information about client. (e.g. 32164@127.0.0.1:Ehcache:CacheManager1)
client_address
Address part of client. (e.g. 127.0.0.1)
client_name
Name part of client. (e.g. Ehcache:CacheManager1)
client_pid
PID part of client (e.g. 32164)
connection_name
Name of the connection set by the user in TMC web application.
clustered
`Y` if cache is clustered, `N` if not clustered.
instance_id
Unique ID representing client.
Example
sag_tc_cache_get_hit_latency_95_percentile{cache="cache1",
cache_manager="CacheManager1",
client="32164@127.0.0.1:Ehcache:CacheManager1",
client_address="127.0.0.1", client_name="Ehcache:CacheManager1",
client_pid="32164", clustered="Y", connection_name="MyCluster",
instance="localhost:9480", instance_id="84bd0e20-26ff-4b9f-ae6c-90622eb48c74",
job="terracotta"} 2110940
Store Metrics
These are the same metrics as you would find under the Using the TCStore Tab section.
Prometheus Metric Name
Metric Description
Type
sag_tc_dataset_add_latency_100_percentile
Latency maxima of dataset add operations.
Gauge.
sag_tc_dataset_add_latency_95_percentile
95th percentile of latencies of dataset add operations.
Gauge.
sag_tc_dataset_add_latency_99_percentile
99th percentile of latencies of dataset add operations.
Gauge.
sag_tc_dataset_delete_latency_100_percentile
Latency maxima of dataset delete operations.
Gauge.
sag_tc_dataset_delete_latency_95_percentile
95th percentile of latencies of dataset delete operations.
Gauge.
sag_tc_dataset_delete_latency_99_percentile
99th percentile of latencies of dataset delete operations.
Gauge.
sag_tc_dataset_get_latency_100_percentile
Latency maxima of dataset read operations.
Gauge.
sag_tc_dataset_get_latency_95_percentile
95th percentile of latencies of dataset read operations.
Gauge.
sag_tc_dataset_get_latency_99_percentile
99th percentile of latencies of dataset read operations.
Gauge.
sag_tc_dataset_update_latency_100_percentile
Latency maxima of dataset update operations.
Gauge.
sag_tc_dataset_update_latency_95_percentile
95th percentile of latencies of dataset update operations.
Gauge.
sag_tc_dataset_update_latency_99_percentile
99th percentile of latencies of dataset update operations.
Gauge.
sag_tc_dataset_add_already_exists_total
The number of Add:AlreadyExists operations.
Counter.
sag_tc_dataset_add_failure_total
The number of Add:Failure operations.
Counter.
sag_tc_dataset_add_success_total
The number of Add:Success operations.
Counter.
sag_tc_dataset_delete_failure_total
The number of Delete:Failure operations.
Counter.
sag_tc_dataset_delete_not_found_total
The number of Delete:NotFound operations.
Counter.
sag_tc_dataset_delete_success_total
The number of Delete:Success operations.
Counter.
sag_tc_dataset_get_failure_total
The number of Get:Failure operations.
Counter.
sag_tc_dataset_get_not_found_total
The number of Get:NotFound operations.
Counter.
sag_tc_dataset_get_success_total
The number of Get:Success operations.
Counter.
sag_tc_dataset_stream_failure_total
The number of Stream:Failure operations.
Counter.
sag_tc_dataset_stream_request_total
The number of Stream:Request operations.
Counter.
sag_tc_dataset_update_failure_total
The number of Update:Failure operations.
Counter.
sag_tc_dataset_update_not_found_total
The number of Update:NotFound operations.
Counter.
sag_tc_dataset_update_success_total
The number of Update:Success operations.
Counter.
Store Metric Label
Label Description
dataset
Name of the dataset.
dataset_manager
Name of the dataset manager.
dataset_instance
Name of the dataset instance.
client
Gives information about client. (e.g. 32164@127.0.0.1:Store:TinyPounderDataset)
client_address
Address part of client. (e.g. 127.0.0.1)
client_name
Name part of client. (e.g. Store:TinyPounderDataset)
client_pid
PID part of client (e.g. 32164)
connection_name
Name of the connection set by the user in TMC web application.
instance_id
Unique ID associated with each client.
Example
sag_tc_dataset_add_latency_95_percentile{client="32164@127.0.0.1:Store:TinyPounderDataset",
client_address="127.0.0.1", client_name="Store:TinyPounderDataset", client_pid="32164",
connection_name="MyCluster", dataset="dataset1", dataset_instance="dataset1-1", dataset_manager="TinyPounderDataset",
instance="localhost:9480", instance_id="1UnvihEwPjFnfjnvGO_MoA", job="terracotta"}
Connecting to Prometheus with Security disabled in TMS
Follow these steps to connect to the Prometheus endpoint with Security disabled in TMS:
1. Navigate to the TMC web application and create a connection to the TSA cluster.
The name of the connection will be the value of the connection_name label. /actuator/prometheus will start returning terracotta metrics in prometheus format.
2. Use the following sample configuration to add Terracotta as a target in the prometheus.yml configuration file.
For more details, refer to the https://prometheus.io/docs/prometheus/latest/configuration/configuration/ page.
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'terracotta'
metrics_path: /actuator/prometheus
static_configs:
- targets: ['localhost:9480']
Connecting to Prometheus with Security enabled in TMS
When user authentication is enabled, all endpoints on TMS become password protected. TMS supports basic authentication scheme to access the /actuator/prometheus endpoint. In order for Prometheus to access the metrics, you need to provide TMS credentials in the basic_auth key of the prometheus configuration file. Follow these steps to connect to the Prometheus endpoint with Security enabled in TMS:
1. Navigate to the TMC web application and create a connection to the TSA cluster.
The name of the connection will be the value of the connection_name label.
2. If SSL is enabled, export the SSL certificate and provide it to the prometheus configuration file in a key called ca_file. To do that, you can use either a command line tool or a graphical tool, like https://keystore-explorer.org.
For example:
keytool -exportcert -alias <tms-alias> -keystore <tms-keystore> -rfc -file <tms-cert>
3. Use the sample configuration to add Terracotta as a target in the prometheus.yml configuration file.
<username> and <password> are the user's username and password, <path-to-tms-certificate> is the valid file path to the TMS certificate.
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'terracotta'
scheme: https
metrics_path: /actuator/prometheus
static_configs:
- targets: ['localhost:9480']
basic_auth:
username: <username>
password: <password>
tls_config:
ca_file: <path-to-tms-certificate>
Rates and Ratios
Terracotta Prometheus endpoint exposes counter metrics from which associated rates and ratios can be calculated with the help of https://prometheus.io/docs/prometheus/latest/querying/basics/. For example:
*You can derive cache hit rate from the sag_tc_cache_hit_count_total counter by using the PromQL rate function.
rate(sag_tc_cache_hit_count_total[2m])
This will calculate the average cache hit rate (hits/second) measured over a 2 minutes window.
*You can calculate the cache hit ratio with the following query (range = 2 minutes).
(rate(sag_tc_cache_hit_count_total[2m]) /
(rate(sag_tc_cache_hit_count_total[2m]) +
rate(sag_tc_cache_miss_count_total[2m])))
Note:
Collector Intervalcontrols how frequently the statistics will be collected in TMS. To achieve graph trends similar to TMC, you can set scrape_interval in the prometheus.yml to be equal to collector interval. The default value of collector interval is 30 seconds. The range value in the rate function should be changed based on the value of scrape_interval.
Getting the metrics directly from the servlet
Using any HTTP client, such as curl :
curl http(s)://[host]:[port]/actuator/prometheus
You can craft a regular expression to search for specific metrics :
curl -s http(s)://[host]:[port]/actuator/prometheus |
grep "sag_tc_server.*{.*} .*"
Prometheus in Kubernetes
There are several ways to let Prometheus grab the metrics available at http(s)://[host]:[port]/actuator/prometheus.
For example, if you deployed the TMC using Kubernetes, and you created a service for it, you can simply add this YAML configuration to your service manifest to have Prometheus read the metrics regularly:
metadata:
name: tmc
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: '/actuator/prometheus'
Prometheus querying
When you have successfully installed and deployed Prometheus, you'll be able to choose the Terracotta cluster metrics.
prometheus metrics
You can also precisely choose which metrics you want to display using the labels and PromQL. For example, if you only want the FRS usage for the "dataroot-2" dataroot on the server "terracotta-1-0", you can use the following Prometheus query :
sag_tc_server_restartable_store_total_usage_bytes{
alias=~'dataroot-2.*',server='terracotta-1-0'}
Visualization with Grafana
Once you have deployed Prometheus, you can use Grafana for data visualizations and monitoring.
To get started, you can import a sample dashboard for Terracotta from the here.
Prometheus Usage Notes
*If you remove a cluster connection from TMC, the Prometheus endpoint will clear all the Terracotta metrics for that connection name.
*If the Prometheus server is unable to get metrics of Terracotta, make sure that the cluster is successfully connected in the TMC web UI dashboard. For additional details, check the TMC and Prometheus logs.