BigMemory 4.3.7 | Product Documentation | BigMemory Max Administrator Guide | Monitoring Cluster Events | Event Types and Definitions
 
Event Types and Definitions
This section describes the types of events that can be found in logs or viewed in the TMC.
*memory.longgc (Memory Manager)
*Level: WARN
*Cause: A full garbage collection (GC) longer than the configured threshold has occurred.
*Action: Reduce cache memory footprint in L1 (Terracotta client). Investigate issues with application logic and garbage creation.
*Notes: The default critical threshold is 8 seconds, but it can be reconfigured in tc.properties using longgc.threshhold. For information about setting tc.properties, see the Terracotta Configuration Parameters.
Occurrence of this event could help diagnose certain failures. For details, see "Configuring the HealthChecker Properties" in the BigMemory Max High-Availability Guide.
*dgc.periodic.started (DGC)
*Level: INFO
*Cause: Periodic distributed garbage collection (DGC), which was explicitly enabled in the configuration, has started a cleanup cycle.
*Action: If periodic DGC is unneeded, disable it to improve overall cluster performance.
*Notes: Periodic DGC, which is disabled by default, is mostly useful in the absence of automatic handling of distributed garbage.
*dgc.periodic.finished (DGC)
*Level: INFO
*Cause: Periodic DGC, which was explicitly enabled in the configuration, ended a cleanup cycle.
*Action: If periodic DGC is unneeded, disable it to improve overall cluster performance.
*Notes: Event message reads "DGC[ {0} ] finished. Begin Count : {1} Collected : {2} Time Taken : {3} ms Live Objects : {4}".
*dgc.periodic.canceled (DGC)
*Level: INFO
*Cause: Periodic DGC, wPeriodichich was explicitly enabled in the configuration, has been cancelled due to an interruption (for example, by a failover operation).
*Action: If periodic DGC is unneeded, disable it to improve overall cluster performance.
*Notes: Periodic DGC, which is disabled by default, is mostly useful in the absence of automatic handling of distributed garbage.
*dgc.inline.cleanup.started (DGC)
*Level: INFO
*Cause: L2 (Terracotta server) is starting up as ACTIVE with existing data, triggering inline distributed garbage collection (DGC).
*Action: No action necessary.
*Notes: Only seen when a server starts up as ACTIVE upon a recovery, using Fast Restartability.
*dgc.inline.cleanup.finished (DGC)
*Level: INFO
*Cause: Inline DGC operation completed.
*Action: No action necessary.
*Notes: Event message reads "Inline DGC [ {0} ] reference cleanup finished. Begin Count : {1} Collected : {2} Time Taken : {3} ms Live Objects : {4}".
*dgc.inline.cleanup.canceled (DGC)
*Level: INFO
*Cause: Inline DGC operation interrupted.
*Action: Investigate any unusual cluster behavior or other events.
*Notes: Possibly occurs during failover, but other events should indicate real cause.
*topology.node.joined (Cluster Topology)
*Level: INFO
*Cause: Specified node has joined the cluster.
*Action: No action necessary.
*Notes: None.
*topology.node.left (Cluster Topology)
*Level: WARN
*Cause: Specified node has left the cluster.
*Action: Check why the node has left (for example: long GC, network issues, or issues with local node resources).
*Notes: None.
*topology.node.state (Cluster Topology)
*Level: INFO
*Cause: L2 changing state (for example, from INITIALIZING to ACTIVE).
*Action: Check to see that the state change is expected.
*Notes: Event message reads "Moved to {0}", where {0} is the new state.
*topology.handshake.reject (Cluster Topology)
*Level: ERROR
*Cause: L1 is unsuccessfully trying to reconnect to cluster, but it has already been expelled.
*Action: If the L1 does not go into a rejoin operation, it must be restarted manually.
*Notes: Event message reads "An {0} client {1} tried to connect to {2} server. Connection refused!!"
*topology.active.left (Cluster Topology)
*Level: WARN
*Cause: Active server left the cluster.
*Action: Check why the active L2 has left.
*Notes: None.
*topology.mirror.left (Cluster Topology)
*Level: WARN
*Cause: Mirror server left the cluster.
*Action: Check why the mirror L2 has left.
*Notes: None.
*topology.zap.received (Cluster Topology)
*Level: CRITICAL
*Cause: One L2 is trying to cause another L2 to restart ("zap").
*Action: Investigate a possible "split brain" situation (a mirror L2 behaves as the ACTIVE) if the zapped L2 does not obey the restart order.
*Notes: A "zap" operation happens only within a mirror group. Event message reads "SPLIT BRAIN, {0} and {1} are ACTIVE", where {0} and {1} are the two servers vying for the ACTIVE role.
*topology.zap.accepted (Cluster Topology)
*Level: CRITICAL
*Cause: The L2 is accepting the order to restart ("zap" order).
*Action: Check the state of the zapped L2 to ensure that it restarts as a mirror, or manually restart it.
*Notes: A "zap" order is issued only within a mirror group. Event message reads "{0} has more clients. Exiting!!", where {0} is the L2 that becomes the ACTIVE.
*topology.db.dirty (Cluster Topology)
*Level: WARN
*Cause: A mirror L2 is trying to join with data in place.
*Action: If the mirror does not automatically restart and wipe its data, its data may need to be manually wiped and before it is restarted.
*Notes: Restarted mirror L2s must wipe their data to resync with the active L2. This is normally an automatic operation that should not require action. Event message reads "Started with dirty database. Exiting!! Restart {0}", where {0} is the the mirror that is automatically restarting.
*topology.config.reloaded (Cluster Topology)
*Level: INFO
*Cause: Cluster configuration was reloaded.
*Action: No action necessary.
*Notes: None.
*dcv2.servermap.eviction (DCV2)
*Level: INFO
*Cause: Automatic evictions for optimizing Terracotta Server Array operations.
*Action: No action necessary.
*Notes: Event message reads "DCV2 Eviction - Time taken (msecs)={0}, Number of entries evicted={1}, Number of segments over threshold={2}, Total Overshoot={3}".
*system.time.different (System Setup)
*Level: WARN
*Cause: System clocks are not aligned.
*Action: Synchronize system clocks.
*Notes: The default tolerance is 30 seconds, but it can be reconfigured in tc.properties using time.sync.threshold. For information about setting tc.properties, see Terracotta Configuration Parameters.
Note that overly large tolerance can introduce unpredictable errors and behaviors.
*resource.capacity.near (Resource)
*Level: WARN
*Cause: L2 entered throttled mode, which could be a temporary condition (e.g., caused by bulk-loading) or could indicate insufficient allocation of memory.
*Action: See Managing Near-Memory-Full Conditions.
*Notes: After emitting this, L2 can emit resource.capacityrestored (return to normal mode) or resource.fullcapacity (move to restricted mode), based on resource availability. Event message reads "{0} is nearing capacity limit, performance may be degraded - {1}% usage", where {0} is the L2 identification and {1} is the % usage of the memory resources allocated to that L2.
*resource.capacity.full (Resource)
*Level: ERROR
*Cause: L2 entered restricted mode, which could be a temporary condition (e.g., caused by bulk-loading) or could indicate insufficient allocation of memory.
*Action: See Managing Near-Memory-Full Conditions.
*Notes: After emitting this, L2 can emit resource.capacityrestored (return to normal mode), based on resource availability. Event message reads "{0} is at over capacity limit, no further additive operations will be accepted - {1}% usage", where {0} is the L2 identification and {1} is the % usage of the memory resources allocated to that L2.
*resource.capacity.restored (Resource)
*Level: INFO
*Cause: L2 returned to normal from throttled or restricted mode.
*Action: No action necessary.
*Notes: Event message reads "{0} capacity has been restored, performance has returned to normal - {1}% usage", where {0} is the L2 identification and {1} is the % usage of the memory resources allocated to that L2.