Monitoring

The Monitoring part of Cloud Container enables you to monitor the health and availability of the solutions and run-time instances, alerts and alert statuses. You receive an email whenever there is a condition that might affect the solution.

Monitoring Solutions

The monitoring of a new solution starts automatically 10 minutes after the creation of the solution. The data of the solution is collected and analyzed every 60 seconds.

You can access the monitoring pages from the left-side navigation menu of the Monitoring main page.

You can filter the information on most Monitoring pages based on time. To specify the time-range, select a value in the time-range selector.

The following table describes the options in the time-range selector.

Option Description
1h Displays the information for the last 1 hour.
6h Default. Displays the information for the last 6 hours.
12h Displays the information for the last 12 hours.
24h Displays the information for the last 1 day.
2d Displays the information for the last 2 days.
1w Displays the information for the last 1 weeks.
2w Displays the information for the last 2 weeks.
4w Displays the information for the last 4 weeks.

To navigate to the Monitoring main page, log in to webMethods Cloud Container, and select Monitoring in the webMethods Cloud Container navigation bar.

Dashboard

On the Dashboard page, you can view:

Solutions

On the Solutions page, you can check the health of the run-time instances from all the solutions. For each run-time instance, you can view the current data, and the data for the last 24 hours.

The health metrics are grouped into three categories:

Runtimes

On the Runtimes page, you can view the graphs for monitored KPIs for the selected run-time instances from all the solutions.

The example image shows the graph for the Used Memory KPI. The horizontal lines below the graph represent the severity and duration of the alerts that were raised for the KPI. The information alerts are displayed in blue, the warning alerts are in orange, and the critical alerts are in red.

The following table describes the meaning of the alert lines from the example graph for the Used Memory KPI.

Time Period Details
1 Until 2:05 h, there had been an open information alert.
2 At 2:05 h, the severity of the information alert was changed to warning.
3 An information alert existed during that period.
4 A warning alert existed during that period.

You can change the value in the Solutions drop-down field to load the information about the run-time instances from a specific solution.

You can use the INTEGRATION SERVER, UNIVERSAL MESSAGING, and TERRACOTTA tabs to view the information related to the selected solution and runtime.

By default, the page displays information for the last 24 hours. To view the information for a different time period, use the time-range selector. For more information about the time-range selector, see Monitoring Solutions.

The following table describes the monitored Integration Server KPIs.

Name Description
Used Memory The total used memory for the Java VM.
Service Threads The number of active service threads.
Sessions The number of active licensed sessions.
Stateful Sessions The number of the current stateful HTTP sessions.

The following table describes the monitored Universal Messaging KPIs.

Name Description
Free Memory The amount of free memory that the Realm Server has within the Java VM. This indicates the difference between what the Java VM has currently allocated and what the Realm Server has used.
Published Events Total number of events published on this realm from the time it started.
Subscribed Events Total number of events that this realm has sent to clients from the time it started.

The following table describes the monitored Terracotta KPIs.

Name Description
Off-Heap Used Memory Shows the amount of off-heap memory that is currently used.
Live Objects Shows the total number of live objects in the cluster, mirror group, server, or clients. If the trend for the total number of live objects goes up continuously, clients in the cluster will eventually run out of memory and applications might fail. Upward trends indicate a problem with application logic, garbage collection, or the tuning of one or more clients.

Viewing Adapter KPIs

On the Runtimes page, you can view the KPIs for the adapters that are installed on the Integration Server instances.

  1. Navigate to the Runtimes page.

  2. Select a solution.

  3. On the INTEGRATION SERVER tab, select an Integration Server instance.

  4. Click Connectivity KPIs.

  5. On the ADAPTERS tab, select an Adapter. The Adapter KPIs are displayed.

The following table describes the monitored Adapter KPIs.

Name Description
Connections The number of connection pools in the adapter and how many of them are currently enabled.
Notifications The number of adapter notifications (polling notifications) and how many of them are currently enabled.

Note: You can view Adapter KPIs only for the current time.

Viewing Connector KPIs

On the Runtimes page, you can view the KPIs for the connectors that are installed on the Integration Server instances.

  1. Navigate to the Runtimes page.

  2. Select a solution.

  3. On the INTEGRATION SERVER tab, select an Integration Server instance.

  4. Click Connectivity KPIs.

  5. Click the CONNECTORS tab.

  6. Select a provider.

  7. Select a connector. The Connector KPIs are displayed.

The following table describes the monitored Connector KPIs.

Name Description
Connections The number of connection pools in the connector and how many of them are currently enabled.
Listeners The number of connector listeners and how many of them are currently enabled.

Note: You can view Connector KPIs only for the current time.

Services

On the Services page, you can view the number of successful and failed service executions of the Integration Server instances from the solutions.

The Services page consists of the Service Executions pane and the History pane.

Pane Description
Service Executions Shows the following information about the service executions of the Integration Server instances for the selected time range:
  • Total number of service executions
  • The number of successful service executions
  • The number of failed service executions
  • The successful service execution, as a percentage value calculated by the formula (Number of successful service executions / total number of service executions) * 100
History Shows a chart with the history of successful (green) and failed (red) service executions. Hovering over the green and red bars displays the total number of successful and failed service executions, correspondingly.

The numbers of service executions on the Services page includes the public and internal services of the Integration Server instance and their child services.

You can change the value in the Solutions drop-down field to view the information about a specific solution, or the information for all solutions.

By default, the page displays information for the last 24 hours. To view the information for a different time period, use the time-range selector. For more information about the time-range selector, see Monitoring Solutions.

Uptime

On the Uptime page, you can view time lines that represent the availability of all run-time instances of the solutions.

The color of the time lines changes based on the status of the run-time instances.

The following table describes the meaning of the different colors.

Time line color Indicates that
green the run-time instance was available during the indicated time period.
red the run-time instance was unavailable during the indicated time period.
grey the run-time instance did not exist during the indicated time period.
blue at least one node from the cluster is unavailable.
yellow a solution update is in progress (the solution is under maintenance).

Note: If the solution uses an Integration Server cluster, the number of Integration Server instances is indicated in brackets after the Integration Server instance name.

By default, the time line displays the availability of the instances during the last 24 hours. To view the information for a different time period, use the time-range selector. For more information about the time-range selector, see Monitoring Solutions.

Alerts

The alert is a notification that a rule is violated.

On the Alerts page you can:

By default, the Alerts page displays the number of alerts (critical, warning, and information) for all the solutions, and detailed information about the alerts in a tabular format.

Note: If the duration of the rule violation is less than the time interval at which the rule is evaluated, the alert does not appear on the Alerts page. For more information about the interval, see Configuring the Alerts.

If you deactivate a solution, the Alerts page will not display the alerts for the solution.

If you activate a solution, the Alerts page will display both the historical alerts for the solutions that had been raised before the deactivation of the solution, and the alerts that were raised after the activation of the solution.

When a solution update starts, the existing active alerts for the solution are set to resolved. During the update period, no alerts are generated for the solution. You can disregard any email alerts that you receive during the upgrade period.

The following table describes the information that is displayed in the table on the Alerts page.

Column Description
Solution Name of the solution.
Runtime Run-time type.
  • Integration Server
  • Universal Messaging
  • Terracotta
  • Instance Name of the run-time instance.
    Start Date Date and time when the alert was raised.
    Resolved On Date and time when the alert was resolved. The field is empty if the alert is still active.
    Message Description of the alert.
    Status Status of the alert.
    • The alert is inactive.
    • The alert is active.

    Note: The Alerts page might not display the alerts for all nodes from a cluster. For example, if you monitor an Integration Server cluster with two Integration Server instances, and both instances have alerts for the same property with different severity, the Alerts page will show the alert of lower severity only, as explained in the following table.

    Integration Server instance Alert type Visibility on the Alerts page
    Integration Server instance 1 Information. Free memory is low. Yes
    Integration Server instance 2 Warning. Free memory is low. No

    You can view all alerts for all the nodes from the cluster in the email alerts.

    By default, the page displays information for the last 1 hour. To view the information for a different time period, use the time-range selector. For more information about the time-range selector, see Monitoring Solutions.

    Alert Types

    The following table provides more information about the alert types.

    Note: Warning alerts and information alerts are not available for KPIs that monitor the availability of a run-time instance.

    Alert Severity Description Color Coding
    Critical A condition exists that is critical for the system performance. red
    Warning A condition exists that might deteriorate the system performance. orange
    Information A condition exists that might evolve into a warning or critical alert. blue

    Configuring the Alerts

    You can change the default threshold values and the recipient email for the system alerts. Threshold values determine when a rule is violated and when the system raises an alert.

    To configure the system alerts

    1.Navigate to the Alerts page.

    2.Select the CONFIGURATION tab. The Configuration page shows information about the alerts for all solutions. The following table describes the columns in the form. boundary values

    Column Description
    Name Alert Name.
    Runtime Integration Server, Universal Messaging, or Terracotta.
    Action The icon activates the configuration view for the alert.

    3.Click the Edit this rule icon in the Action column for the alert that you want to configure. A form with the configuration details for the alert rule is displayed. The following table describes the fields in the form.

    Field Description
    Threshold The KPI’s boundary values. When the value of the KPI is outside the range that is specified by these boundary values, the alert is raised.
    You can configure the threshold values of critical alerts by adjusting the ends of the red line.
    You can configure the threshold values of warning alerts by adjusting the ends of the orange line.
    You can configure the threshold values of information alerts by adjusting the ends of the blue line.
    Note: The Threshold field is read-only for KPIs that monitor the availability of a runtime instance.
    Runtime Integration Server, Universal Messaging, or Terracotta.
    Summary Summary of the alert.
    Interval The scrape interval. The scrape interval is the frequency at which the system collects the data. The scrape interval is 60 seconds for all rules. Read-only.
    Note: The alert does not appear immediately when the corresponding rule violation occurs.
    The time delay from the actual time of the rule violation to the system alert is the following:
    • up to 70 seconds for run-time availability rules of critical severity
    • up to 420 seconds for run-time availability rules of information severity
    • up to 180 seconds for the rest of the rules
    The system will not send an alert if the rule violation condition is resolved during the corresponding delay period.
    Email on alert Email of the user who will receive the alerts.
    Note: The email is used for all rules. In case there are alerts, the system sends emails once every 10 minutes. The email alerts always display UTC time.
    To configure more than one email, use comma-separated values.

    4.In the Threshold field, change the default threshold value(s) for the alert.

    5.In the Email on Alert field, type the email(s) of the user(s) who will receive the email alerts for all rules.

    Note: webMethods Cloud Container stores the email(s) in the local alertManager.yaml file. When you uninstall webMethods Cloud Container or delete a tenant, the related information is deleted automatically.

    6.Click Apply.

    Alert Actions

    You can take actions and resolve the problems with the solutions that caused the alerts. The following table relates the alert with the probable cause of the problem, and the recommended actions that you can take to resolve the problem.

    Alert Name Probable Cause Action to Resolve the Alert
    ISFreeMemoryLow The memory usage is reaching the configured thresholds.
    If the memory usage is continuously reaching 95% and above, and you do not observe any flaw in your application, then probably there is another memory-intensive application.
    Allocate more memory to the solution.
    ISRuntimeSessionUsageHigh The solution uses too many sessions and there might not be free sessions for new requests. Try one of the following:
    - Stop some unnecessary services, if any
    - Increase the maximum number of active licensed sessions
    - Move some of the workload to another solution
    ISRuntimeStatefulSessionUsageHigh The number of the current stateful HTTP sessions is high. There might not be enough bandwidth for new sessions. Move some of the workload to another solution.
    ISRuntimeUnavailable Integration Server is down. Try one of the following:
    - If the Integration Server went down because of a high workload, create an Integration Server cluster.
    - If the Integration Server went down because of insufficient memory and you also get a memory alert, allocate more memory for Integration Server.
    TCOffHeapMemoryLow The heap-off memory has reached the threshold because of too much stored data. Stop adding data or delete some data from the heap-off memory.
    TCRuntimeUnavailable The Terracotta server went down, or there was a human mistake (for example, somebody shut down the Terracotta server). Restart the Terracotta server. For greater safety and security, start with the server that was shut down last.
    UMRuntimeUnavailable The Universal Messaging server is down. Restart the Universal Messaging server. If the problem persists, contact the Software AG Global Support.
    UMFreeMemoryLow The memory usage is reaching the configured thresholds.
    If the memory usage is continuously reaching 95% and above, and you do not observe any flaw in your application, then probably there is another memory-intensive application.
    Increase the memory for Universal Messaging.

    Logs

    The Logs page (Monitoring > Logs) gives you an access to download various logs of Integration Server and Universal Messaging within the webMethods Cloud Container.

    There are two logs view associated with webMethods Cloud Container:

    Note: By default, webMethods Cloud Container displays the logs accessed from Download.

    To view the runtime logs of Integration Server of the same day, you use the Integration Server admin console (Solutions > Manage > Administration).

    Viewing and Downloading the logs for a specific run-time instance in a solution for a product

    Here is an example on how to view and download the logs for a specific run-time instance in a solution for Integration Server.

    1. Select a time period from the time-range selector.

      By default, the page displays the logs for the last 1 hour. To view the logs for a different time period, use the time-range selector. For more information about the time-range selector, see Monitoring Solutions.

    2. Select a solution from the top dorop-down list box.
      The list box displays all the active solution available in the landscape model. The active solution that we selected has two products Integration Server and Universal Messaging.

    3. Select Integration Server.
      You will see all the available instances for this product in the drop-down list.

    4. Select a run-time instance and click the Download button.

      You can view logs details, such as the log file name, the date when the log file was created, and the size of the log file. To download the log details, click the icon under Actions. webMethods Cloud Container downloads one file at a time. Multi-select option is not enabled for downloading the log files.

      You can view specific log lists by using filters. The filters are provided on the top levels of the log results based on the folders available in the product. By deselecting any filter, you can remove the log results from the list.

      Note: By default, the retention period for old logs is 30 days or four weeks (4W), which means that you can view and download only the 30 days log file details.